* performance issues (nat / conntrack) @ 2002-06-20 19:48 Jean-Michel Hemstedt 2002-06-22 16:51 ` Harald Welte 2002-06-25 10:35 ` Jozsef Kadlecsik 0 siblings, 2 replies; 33+ messages in thread From: Jean-Michel Hemstedt @ 2002-06-20 19:48 UTC (permalink / raw) To: netfilter-devel dear netdevels, I'm doing some tcp benches on a netfilter enabled box and noticed huge and surprising perf decrease when loading iptable_nat module. - ip_conntrack is of course also loading the system, but with huge memory and a large bucket size, the problem can be solved. The big issue with ip_conntrack are the state timeouts: it simply kill the system and drops all the traffic with the default ones, because the ip_conntrack table becomes quickly full, and it seems that there is no way to recover from that situation... Keeping unused entries (time_close) even 1 minute in the cache is really not suitable for configurations handling (relatively) large number of connections/s. o The cumulative effect should be reconsidered. o Are there ways/plans to tune the timeouts dynamically? and what are the valid/invalid ranges of timeouts? o looking at the code, it seems that one timer is started by tuple... wouldn't it be more efficient to have a unique periodic callback scanning the whole or part of the table for aged entries? - The annoying point is iptable_nat: normally the number of entries in the nat table is much lower than the number of entries in the conntrack table. So even if the hash function itself could be less efficient than the ip_conntrack one (because it takes less arguments: src+dst+proto), the load of nat, should be much lower than the load of conntrack. o So... why is it the opposite?? o Are there ways to tune the nat performances? - Another (old) question: why are conntrack or nat active when there are no rules configured (using them or not)? If not fixed it should be at least documented... Somebody doing "iptables -t nat -L" takes the risk of killing its system if it's already under load... In the same spirit, iptables -F should unload all unused modules (the ip_tables modules doesn't hurt). Just one quick fix: replace the 'iptables' executable by one 'iptables' script calling the exe (located somewhere else) and doing an rmmod at the end... comments are welcome; here is my test bed: tested target: -kernel 2.4.18 + non_local_bind + small conntrack timeouts... -PIII~500MHz, RAM=256MB -2*100Mb/s NIC The target acts as a forwarding gateway between a load generator client running httperf, and an apache proxy serving cached pages. 100Mb/s NICs and requests/response sizes insure that BW and packet collisions is not an issue. Since in my test, each connection is ephemeral (<10ms), i recompiled the kernel with very short conntrack timeouts (i.e: 1 sec for close_wait, and about 60 sec for established!) This was also the only way to restrict the conntrack hash table size (given my RAM) and avoid exagerated hash collisions. Another limitation comes from my load generator creating traffic from one source to one destination ipa, with only source port variation (but given my configured hash table size and the hash function itself it shouldn't have been an issue). results are averages from procinfo -n10 [d] test results: 1) target = forwarding only (no iptables module or rule) - rate : 100 conn/s (=request-response/s) -> CPU load : 0% system -> context : 7 context/s -> irq(eth0/eth1): 0.9 / 0.9 kpps (# of packet/sec = #irq/s) - rate : 500 conn/s -> CPU load : 10% system -> context : 18->100 context/s (varying!) -> irq(eth0/eth1): 4.4 / 4.4 kpps - rate (max) : 1050 conn/s (max from my load generator) -> CPU load : 25% system -> context : 1000 context/s -> irq(eth0/eth1): 10 / 10 kpps 2) (1) + insmod ip_conntrack 16384 (no rules) - rate : 100 conn/s -> CPU load : 0.8% system -> context : 7 context/s -> irq(eth0/eth1): 0.9 / 0.9 kpps -> conntrack size: 970 concurrent entries - rate : 250 conn/s -> CPU load : 10% system -> context : 12 context/s -> irq(eth0/eth1): 2.2 / 2.2 kpps -> conntrack size: 2390 concurrent entries - rate : 500 conn/s -> CPU load : 30-70% system (varying) -> context : 45-90 context/s -> irq(eth0/eth1): 4 / 4 kpps -> conntrack size: 4770 concurrent entries 3) (2) + iptables -t nat -L [=iptable_nat] (no rules) - rate : 100 conn/s -> CPU load : 1% system -> context : 8 context/s -> irq(eth0/eth1): 0.9 / 0.9 kpps -> conntrack size: 970 concurrent entries - rate : 250 conn/s -> CPU load : 40% system -> context : 20 context/s -> irq(eth0/eth1): 2.2 / 2.2 kpps -> conntrack size: 2390 concurrent entries - rate (max) : 420 conn/s (all failed) -> CPU load : 97% system -> context : 28 context/s -> irq(eth0/eth1): 3.1 / 4.1 kpps -> conntrack size: 4050 concurrent entries - rate (killing): [500]->0 conn/s (all failed) -> CPU load : 100% system (no response) -> context : ? context/s -> irq(eth0/eth1): ? kpps -> conntrack size: 10500??? concurrent entries other results with active rules (i.e. REDIRECT) are dependent of the load generated by the local process handling the traffic, and are thus not relevant (FYI: max conn/s < 200 with one process handling the REDIRECTed traffic) kr, _______________________________________________________________________ -jmhe- ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-20 19:48 performance issues (nat / conntrack) Jean-Michel Hemstedt @ 2002-06-22 16:51 ` Harald Welte 2002-06-23 9:15 ` Jean-Michel Hemstedt 2002-06-25 10:35 ` Jozsef Kadlecsik 1 sibling, 1 reply; 33+ messages in thread From: Harald Welte @ 2002-06-22 16:51 UTC (permalink / raw) To: Jean-Michel Hemstedt; +Cc: netfilter-devel [-- Attachment #1: Type: text/plain, Size: 3881 bytes --] On Thu, Jun 20, 2002 at 09:48:27PM +0200, Jean-Michel Hemstedt wrote: > dear netdevels, > > I'm doing some tcp benches on a netfilter enabled box and noticed > huge and surprising perf decrease when loading iptable_nat module. Sounds as expected. > - ip_conntrack is of course also loading the system, but with huge memory > and a large bucket size, the problem can be solved. The big issue with > ip_conntrack are the state timeouts: it simply kill the system and drops > all the traffic with the default ones, because the ip_conntrack table > becomes quickly full, and it seems that there is no way to recover from > that situation... Keeping unused entries (time_close) even 1 minute in > the cache is really not suitable for configurations handling (relatively) > large number of connections/s. what is a 'relatively' large number of connections? I've seen a couple of netfilter firewalls dealing with 200000+ tracked connections. > o The cumulative effect should be reconsidered. could you please try to explain what you mean? > o Are there ways/plans to tune the timeouts dynamically? and what are > the valid/invalid ranges of timeouts? No, see the mailinglist archives for th reason why. > o looking at the code, it seems that one timer is started by tuple... > wouldn't it be more efficient to have a unique periodic callback > scanning the whole or part of the table for aged entries? I think somebody (Martin Josefsson?) is currently looking into optimizing > - The annoying point is iptable_nat: normally the number of entries in > the nat table is much lower than the number of entries in the conntrack > table. So even if the hash function itself could be less efficient than > the ip_conntrack one (because it takes less arguments: src+dst+proto), > the load of nat, should be much lower than the load of conntrack. > o So... why is it the opposite?? ? What 'nat table' are you talking about? Do you understand how NAT works and how it interacts with connection tracking? > o Are there ways to tune the nat performances? no. NAT (and esp. NAT performance) is not a very strong point of netfilter. Everybody agrees that NAT is evil and it should be avoided in all circumstances. Rusty didn't want to become NAT/masquerading maintainer in the first place, but rather concentrate on packet filtering. The NAT subsystem has a number of shortcomings, some of which have been fixed, other still remain. > - Another (old) question: why are conntrack or nat active when there are > no rules configured (using them or not)? If not fixed it should be at > least documented... This is standard behaviour. Does your network driver unload if you 'ifconfig down' an interface? Does a TC qdisc module unload if you delete all instances of the queue? conntrack is _not_ related/intermangled with iptables at all. Conntrack does not know if anybody is using conntrack state in the system. > Somebody doing "iptables -t nat -L" takes the risk > of killing its system if it's already under load... ? Please explain why. I see no reason for this. > In the same spirit, > iptables -F should unload all unused modules (the ip_tables modules > doesn't hurt). Just one quick fix: replace the 'iptables' executable by > one 'iptables' script calling the exe (located somewhere else) and > doing an rmmod at the end... no. this is considered a feature. The current [and past] behaviour is wanted like this by design. > -jmhe- -- Live long and prosper - Harald Welte / laforge@gnumonks.org http://www.gnumonks.org/ ============================================================================ GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M+ V-- PS++ PE-- Y++ PGP++ t+ 5-- !X !R tv-- b+++ !DI !D G+ e* h--- r++ y+(*) [-- Attachment #2: Type: application/pgp-signature, Size: 232 bytes --] ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-22 16:51 ` Harald Welte @ 2002-06-23 9:15 ` Jean-Michel Hemstedt 2002-06-25 11:33 ` Jozsef Kadlecsik 0 siblings, 1 reply; 33+ messages in thread From: Jean-Michel Hemstedt @ 2002-06-23 9:15 UTC (permalink / raw) To: Harald Welte; +Cc: netfilter-devel I know this debate is not new... I just didn't expect such a (90% see below) perf drop, and unavailablity risk. That's why I'm only reporting it, hoping secretly that experienced hackers will consider it seriously. ;o) Note: I don't want to play with words, but if you prefer, consider 'load generator' as 'malicious DoS user', and 'perf issue' as 'DoS vulnerability' as Don Cohen cleverly suggested :-/ (for me it's the same problem, except that DoS is ponctual while perf is what we may expect in normal situation) > > > > I'm doing some tcp benches on a netfilter enabled box and noticed > > huge and surprising perf decrease when loading iptable_nat module. > > Sounds as expected. loading a module, doesn't mean using it (lsmod reports it as 'unused' in my tests). So, does it really 'sounds as expected', when you see your cpu load hitting 100%, and most packets dropped just after having done 'iptables -t nat -L' on a system with 1%CPU load handling 'only' 10kpps and forwarding about 1000 new TCP connections/s? > > > - ip_conntrack is of course also loading the system, but with huge memory > > and a large bucket size, the problem can be solved. The big issue with > > ip_conntrack are the state timeouts: it simply kill the system and drops > > all the traffic with the default ones, because the ip_conntrack table > > becomes quickly full, and it seems that there is no way to recover from > > that situation... Keeping unused entries (time_close) even 1 minute in > > the cache is really not suitable for configurations handling (relatively) > > large number of connections/s. > > what is a 'relatively' large number of connections? I've seen a couple > of netfilter firewalls dealing with 200000+ tracked connections. 200K concurrent established connections, maybe... but surely not NEW connections/second. See previous results: with only ip_conntrack loaded (no nat), I hardly reached 500 (new) conn/s. > > > o The cumulative effect should be reconsidered. > > could you please try to explain what you mean? There are 3 aspects: - table exhaustion (can be fixed with large memory) as long as the hash is correctly distributed (few collisions) - concurrent timers (1 per conntrack tuple??) - I can't explain the last one, but when the table is exhausted conntrack drops new packets, right? What I noticed is that at that moment, the cpu load suddenly hit 100%, and the machine did not recover, unless I killed the load generator > > > o Are there ways/plans to tune the timeouts dynamically? and what are > > the valid/invalid ranges of timeouts? > > No, see the mailinglist archives for th reason why. If you refer to your mail of 18 January 2001, I think that this timeout should also be reviewed ;o)... Waiting for somebody having the time and being able of doing a redesign was quite idealistic, while a quick patch for configurable timeouts per rule (ie: http timeouts different from smtp ones, as suggested by Denis Ducamp) would have been more realistic. > > > o looking at the code, it seems that one timer is started by tuple... > > wouldn't it be more efficient to have a unique periodic callback > > scanning the whole or part of the table for aged entries? > > I think somebody (Martin Josefsson?) is currently looking into optimizing > > > - The annoying point is iptable_nat: normally the number of entries in > > the nat table is much lower than the number of entries in the conntrack > > table. So even if the hash function itself could be less efficient than > > the ip_conntrack one (because it takes less arguments: src+dst+proto), > > the load of nat, should be much lower than the load of conntrack. > > o So... why is it the opposite?? > > ? What 'nat table' are you talking about? Do you understand how NAT > works and how it interacts with connection tracking? Actually, that's also what i would like to know ;o) bysource or byisproto hash tables, pointing to ip_nat_hash tuples pointing to ip_conntrack entry. But i don't understand where the extra processing comes from when there are no (nat) rules defined. Just to recall my test: I generated an amount of new connections per second passing through a forwarding machine without any iptables module and measured the cpu load/responsiveness and other things... Then while the machine was sustaining this amount of new conn/s, i did 'insmod ip_conntrack [size]', saw the cpu load increasing, and finally just did 'iptables -t nat -L' to load the nat module without any rule, and saw again the cpu load increasing. With 500conn/s, the cpu load went from 10% -> ~50/70% -> 100% (machine unavailable). > > > o Are there ways to tune the nat performances? > > no. NAT (and esp. NAT performance) is not a very strong point of netfilter. > Everybody agrees that NAT is evil and it should be avoided in all circumstances. > Rusty didn't want to become NAT/masquerading maintainer in the first place, > but rather concentrate on packet filtering. wow! what is the alternative for 'Everybody' using REDIRECT? > > The NAT subsystem has a number of shortcomings, some of which have been > fixed, other still remain. > > > - Another (old) question: why are conntrack or nat active when there are > > no rules configured (using them or not)? If not fixed it should be at > > least documented... > > This is standard behaviour. Does your network driver unload if you > 'ifconfig down' an interface? Does a TC qdisc module unload if you > delete all instances of the queue? ok, but does your interface sends irq when it is down? I don't care about having an 'unused' module in memory as long as it is doing nothing and not (over)loading the system. > > conntrack is _not_ related/intermangled with iptables at all. Conntrack > does not know if anybody is using conntrack state in the system. > > > Somebody doing "iptables -t nat -L" takes the risk > > of killing its system if it's already under load... > > ? Please explain why. I see no reason for this. We agree, i also don't see any reason for it. see above: a 'clean' machine without iptables modules or rule which is handling 500conn/s hit 100%cpu and becomes unavailable if you do 'iptables -t nat -L'. > > > In the same spirit, > > iptables -F should unload all unused modules (the ip_tables modules > > doesn't hurt). Just one quick fix: replace the 'iptables' executable by > > one 'iptables' script calling the exe (located somewhere else) and > > doing an rmmod at the end... > > no. this is considered a feature. The current [and past] behaviour is wanted > like this by design. that's a... choice. > - Harald Welte / laforge@gnumonks.org http://www.gnumonks.org/ _______________________________________________________________________ -jmhe- He who expects nothing shall never be disappointed ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-23 9:15 ` Jean-Michel Hemstedt @ 2002-06-25 11:33 ` Jozsef Kadlecsik 2002-06-25 12:47 ` Harald Welte 2002-06-25 13:21 ` Jean-Michel Hemstedt 0 siblings, 2 replies; 33+ messages in thread From: Jozsef Kadlecsik @ 2002-06-25 11:33 UTC (permalink / raw) To: Jean-Michel Hemstedt; +Cc: Harald Welte, netfilter-devel On Sun, 23 Jun 2002, Jean-Michel Hemstedt wrote: > > > I'm doing some tcp benches on a netfilter enabled box and noticed > > > huge and surprising perf decrease when loading iptable_nat module. > > > > Sounds as expected. > > loading a module, doesn't mean using it (lsmod reports it as 'unused' > in my tests). So, does it really 'sounds as expected', when you see >From where do you think that the module usage counter reports how many packets/connections are handled (currently? totally?) by the module. There is no whatsoever connection! > > > o The cumulative effect should be reconsidered. > > - I can't explain the last one, but when the table is exhausted > conntrack drops new packets, right? What I noticed is that at that > moment, the cpu load suddenly hit 100%, and the machine did not > recover, unless I killed the load generator That is unusual and should be tested further. > > ? What 'nat table' are you talking about? Do you understand how NAT > > works and how it interacts with connection tracking? > > Just to recall my test: I generated an amount of new connections > per second passing through a forwarding machine without any iptables > module and measured the cpu load/responsiveness and other things... > Then while the machine was sustaining this amount of new conn/s, i did > 'insmod ip_conntrack [size]', saw the cpu load increasing, and finally > just did 'iptables -t nat -L' to load the nat module without any rule, > and saw again the cpu load increasing. With 500conn/s, the cpu load went > from 10% -> ~50/70% -> 100% (machine unavailable). According to your first mail, the machine has 256M RAM and you issued insmod ip_conntrack 16384 That requires 16384*8*~600byte ~= 75MB non-swappable RAM. When you issued "iptables -t nat -L", the system tried to reserve plus 2x75MB. That's in total pretty near to all your available physical RAM and the machine might died in swapping. Regards, Jozsef - E-mail : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu WWW-Home: http://www.kfki.hu/~kadlec Address : KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-25 11:33 ` Jozsef Kadlecsik @ 2002-06-25 12:47 ` Harald Welte 2002-06-25 14:23 ` Jozsef Kadlecsik ` (2 more replies) 2002-06-25 13:21 ` Jean-Michel Hemstedt 1 sibling, 3 replies; 33+ messages in thread From: Harald Welte @ 2002-06-25 12:47 UTC (permalink / raw) To: Jozsef Kadlecsik; +Cc: Jean-Michel Hemstedt, netfilter-devel On Tue, Jun 25, 2002 at 01:33:13PM +0200, Jozsef Kadlecsik wrote: > From where do you think that the module usage counter reports how many > packets/connections are handled (currently? totally?) by the module. > There is no whatsoever connection! one should also consider the performance impact this would have !!! > According to your first mail, the machine has 256M RAM and you issued > > insmod ip_conntrack 16384 > > That requires 16384*8*~600byte ~= 75MB non-swappable RAM. > > When you issued "iptables -t nat -L", the system tried to reserve plus > 2x75MB. That's in total pretty near to all your available physical RAM > and the machine might died in swapping. ??? Why should listing an IP table try to reserve twice the size of the conntrack table? > Regards, > Jozsef -- Live long and prosper - Harald Welte / laforge@gnumonks.org http://www.gnumonks.org/ ============================================================================ GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*) ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-25 12:47 ` Harald Welte @ 2002-06-25 14:23 ` Jozsef Kadlecsik [not found] ` <025001c21c50$763fa880$0489cb8a@etbx180> 2002-06-25 21:08 ` Jozsef Kadlecsik 2 siblings, 0 replies; 33+ messages in thread From: Jozsef Kadlecsik @ 2002-06-25 14:23 UTC (permalink / raw) To: Harald Welte; +Cc: Jean-Michel Hemstedt, netfilter-devel On Tue, 25 Jun 2002, Harald Welte wrote: > > According to your first mail, the machine has 256M RAM and you issued > > > > insmod ip_conntrack 16384 > > > > That requires 16384*8*~600byte ~= 75MB non-swappable RAM. > > > > When you issued "iptables -t nat -L", the system tried to reserve plus > > 2x75MB. That's in total pretty near to all your available physical RAM > > and the machine might died in swapping. > > ??? Why should listing an IP table try to reserve twice the size of the > conntrack table? By entering the command above, he loads in the iptable_nat kernel module, which at initializing itself try to allocate memory for the bysource and byipsproto hashes (with the same size as of ip_conntrack_hash). Regards, Jozsef - E-mail : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu WWW-Home: http://www.kfki.hu/~kadlec Address : KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary ^ permalink raw reply [flat|nested] 33+ messages in thread
[parent not found: <025001c21c50$763fa880$0489cb8a@etbx180>]
* Re: performance issues (nat / conntrack) [not found] ` <025001c21c50$763fa880$0489cb8a@etbx180> @ 2002-06-25 16:07 ` Harald Welte 0 siblings, 0 replies; 33+ messages in thread From: Harald Welte @ 2002-06-25 16:07 UTC (permalink / raw) To: Jean-Michel Hemstedt; +Cc: netfilter-devel On Tue, Jun 25, 2002 at 03:59:01PM +0200, Jean-Michel Hemstedt wrote: > > ??? Why should listing an IP table try to reserve twice the size of the > > conntrack table? > > this is in nat_init (or so): nat takes the conntrack hash size to > allocate 2 additional nat hashes 'bysource' and 'byisproto'. Ah. I was not aware that you didn't have iptable_nat loaded before the command. Just issuing the '-L' command with no nat loaded does not allocate anything big inside the kernel. > The question is, why do we init it, if we don't use it (on a rule > point of view)? This init step should occur only if we insert a rule > using nat. no. This is again something I regard as feature, not as bug. dont load the module if you don't use it. It's the same behaviour like conntrack. > kr, > -jmhe- -- Live long and prosper - Harald Welte / laforge@gnumonks.org http://www.gnumonks.org/ ============================================================================ GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*) ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-25 12:47 ` Harald Welte 2002-06-25 14:23 ` Jozsef Kadlecsik [not found] ` <025001c21c50$763fa880$0489cb8a@etbx180> @ 2002-06-25 21:08 ` Jozsef Kadlecsik 2 siblings, 0 replies; 33+ messages in thread From: Jozsef Kadlecsik @ 2002-06-25 21:08 UTC (permalink / raw) To: Harald Welte; +Cc: Jean-Michel Hemstedt, netfilter-devel On Tue, 25 Jun 2002, Harald Welte wrote: > > According to your first mail, the machine has 256M RAM and you issued > > > > insmod ip_conntrack 16384 > > > > That requires 16384*8*~600byte ~= 75MB non-swappable RAM. > > > > When you issued "iptables -t nat -L", the system tried to reserve plus > > 2x75MB. That's in total pretty near to all your available physical RAM > > and the machine might died in swapping. > > ??? Why should listing an IP table try to reserve twice the size of the > conntrack table? Harald, Bazsi, of course you are totally right and I wrote bullshit above. Of course loading iptable_nat does *not* involve a memory requirement comparable to ip_conntrack. Hash element sizes are equal but structure sizes in the hashes are far from each other. Sorry for spreading false information. :-( Regards, Jozsef - E-mail : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu WWW-Home: http://www.kfki.hu/~kadlec Address : KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-25 11:33 ` Jozsef Kadlecsik 2002-06-25 12:47 ` Harald Welte @ 2002-06-25 13:21 ` Jean-Michel Hemstedt 2002-06-25 13:51 ` Harald Welte ` (2 more replies) 1 sibling, 3 replies; 33+ messages in thread From: Jean-Michel Hemstedt @ 2002-06-25 13:21 UTC (permalink / raw) To: Jozsef Kadlecsik; +Cc: Harald Welte, netfilter-devel > > loading a module, doesn't mean using it (lsmod reports it as 'unused' > > in my tests). So, does it really 'sounds as expected', when you see > > From where do you think that the module usage counter reports how many > packets/connections are handled (currently? totally?) by the module. > There is no whatsoever connection! module usage counter increases when a TARGET needs it (i.e. ipt_REDIRECT). In this test, no rule was defined, and no target module was loaded. So I did not expect NAT to process any packet. > > > > > o The cumulative effect should be reconsidered. > > > > - I can't explain the last one, but when the table is exhausted > > conntrack drops new packets, right? What I noticed is that at that > > moment, the cpu load suddenly hit 100%, and the machine did not > > recover, unless I killed the load generator > > That is unusual and should be tested further. I suppose that due to the load, packets are dropped not because of conntrack but because they simply can't be processed, and thus conntrack misses packets of existing connections (such as FIN, RST) and can't thus recover due to its timeouts. > > > > ? What 'nat table' are you talking about? Do you understand how NAT > > > works and how it interacts with connection tracking? > > > > Just to recall my test: I generated an amount of new connections > > per second passing through a forwarding machine without any iptables > > module and measured the cpu load/responsiveness and other things... > > Then while the machine was sustaining this amount of new conn/s, i did > > 'insmod ip_conntrack [size]', saw the cpu load increasing, and finally > > just did 'iptables -t nat -L' to load the nat module without any rule, > > and saw again the cpu load increasing. With 500conn/s, the cpu load went > > from 10% -> ~50/70% -> 100% (machine unavailable). > > According to your first mail, the machine has 256M RAM and you issued > > insmod ip_conntrack 16384 > > That requires 16384*8*~600byte ~= 75MB non-swappable RAM. > > When you issued "iptables -t nat -L", the system tried to reserve plus > 2x75MB. That's in total pretty near to all your available physical RAM > and the machine might died in swapping. > exact! That's why I looked (but not closely) at swap-in/swap-out in procinfo, but didn't notice anything (0 most of the time on 10 sec average). But I agree that I was close to the limit, and even over when I tried 32K. Despite that, nothing so surpising to have so few swaps, since my table was not full (max 4000 up to 10000 concurrent tuples). But this raises one additional problem: 1) the hash index size and the hash total size should be configurable separately (get rid of that factor 8, and use a free list for the tuple allocation). 2) NAT hash sizes should also be configurable independently from conntrack. Normally the nat hashes are smaller than conntrack hash, since conntrack is based on ports, while nat is not. PS: could anybody redo similar tests so that we can compare the results and stop killing the messenger, please? ;o) > Regards, > Jozsef > - > E-mail : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu > WWW-Home: http://www.kfki.hu/~kadlec > Address : KFKI Research Institute for Particle and Nuclear Physics > H-1525 Budapest 114, POB. 49, Hungary > > > > ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-25 13:21 ` Jean-Michel Hemstedt @ 2002-06-25 13:51 ` Harald Welte 2002-06-25 14:33 ` Jozsef Kadlecsik 2002-06-25 14:51 ` Jean-Michel Hemstedt 2002-06-25 13:52 ` Patrick Schaaf 2002-06-25 14:53 ` Jozsef Kadlecsik 2 siblings, 2 replies; 33+ messages in thread From: Harald Welte @ 2002-06-25 13:51 UTC (permalink / raw) To: Jean-Michel Hemstedt; +Cc: Jozsef Kadlecsik, netfilter-devel On Tue, Jun 25, 2002 at 03:21:56PM +0200, Jean-Michel Hemstedt wrote: > > > loading a module, doesn't mean using it (lsmod reports it as 'unused' > > > in my tests). So, does it really 'sounds as expected', when you see > > > > From where do you think that the module usage counter reports how many > > packets/connections are handled (currently? totally?) by the module. > > There is no whatsoever connection! > > module usage counter increases when a TARGET needs it (i.e. ipt_REDIRECT). > In this test, no rule was defined, and no target module was loaded. > So I did not expect NAT to process any packet. the way NAT is implemented currently, it always processes every packet the same way. For a NEW packet where we don't find a nat rule, we allocate a 'null binding' telling the nat code that there is no nat transformation to be made . > But this raises one additional problem: > 1) the hash index size and the hash total size should be configurable > separately (get rid of that factor 8, and use a free list for the tuple > allocation). > 2) NAT hash sizes should also be configurable independently from conntrack. > Normally the nat hashes are smaller than conntrack hash, since conntrack > is based on ports, while nat is not. both of this is already true. look at the module loadtime parameters of ip_conntrack.o and iptable_nat.o -- Live long and prosper - Harald Welte / laforge@gnumonks.org http://www.gnumonks.org/ ============================================================================ GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*) ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-25 13:51 ` Harald Welte @ 2002-06-25 14:33 ` Jozsef Kadlecsik 2002-06-25 14:51 ` Jean-Michel Hemstedt 1 sibling, 0 replies; 33+ messages in thread From: Jozsef Kadlecsik @ 2002-06-25 14:33 UTC (permalink / raw) To: Harald Welte; +Cc: Jean-Michel Hemstedt, netfilter-devel On Tue, 25 Jun 2002, Harald Welte wrote: > > But this raises one additional problem: > > 1) the hash index size and the hash total size should be configurable > > separately (get rid of that factor 8, and use a free list for the tuple > > allocation). > > 2) NAT hash sizes should also be configurable independently from conntrack. > > Normally the nat hashes are smaller than conntrack hash, since conntrack > > is based on ports, while nat is not. > > both of this is already true. look at the module loadtime parameters of > ip_conntrack.o and iptable_nat.o One must set hashsize for the ip_conntrack module and then tweak of /proc/sys/net/ip_conntrack_max in order to get rid of the factor 8. But we do not have a module parameter yet for setting the hashsizes of iptable_nat independently. Regards, Jozsef - E-mail : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu WWW-Home: http://www.kfki.hu/~kadlec Address : KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-25 13:51 ` Harald Welte 2002-06-25 14:33 ` Jozsef Kadlecsik @ 2002-06-25 14:51 ` Jean-Michel Hemstedt 2002-06-25 16:11 ` Harald Welte 1 sibling, 1 reply; 33+ messages in thread From: Jean-Michel Hemstedt @ 2002-06-25 14:51 UTC (permalink / raw) To: Harald Welte; +Cc: Jozsef Kadlecsik, netfilter-devel > > But this raises one additional problem: > > 1) the hash index size and the hash total size should be configurable > > separately (get rid of that factor 8, and use a free list for the tuple > > allocation). > > 2) NAT hash sizes should also be configurable independently from conntrack. > > Normally the nat hashes are smaller than conntrack hash, since conntrack > > is based on ports, while nat is not. > > both of this is already true. look at the module loadtime parameters of > ip_conntrack.o and iptable_nat.o right for conntrack, but i can't find something similar for nat: conntrack: ---------- - ip_conntrack_htable_size : load time param : allocated at init : 16? bytes per list head - ip_conntrack_max: /proc setting only after the module is loaded : tuples allocated on demand (kmem_chache_alloc) : 392 bytes per tuple. => that's why i'm not swapping when my table is not full... but in ip_conntrack_init(): 1093 ip_conntrack_max = 8 * ip_conntrack_htable_size; => when the module is loaded, it is loaded with this default value. could be good to have it as loadable parameter in order to save it and restore in modules.conf nat: ---- (from ip_nat_init): - ip_nat_htable_size = ip_conntrack_htable_size; (not configurable) : allocated at init twice (for bysource and byisproto hashes) - max tuples??? haven't found any value neither any config data. (is it in patch-o-matic)? but the tuples are allocated on demand. PS: the fact that tuples are allocated on demand (392bytes/tuple) and not at init explains also why I was not swapping. (just facts ;o)) ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-25 14:51 ` Jean-Michel Hemstedt @ 2002-06-25 16:11 ` Harald Welte 0 siblings, 0 replies; 33+ messages in thread From: Harald Welte @ 2002-06-25 16:11 UTC (permalink / raw) To: Jean-Michel Hemstedt; +Cc: Jozsef Kadlecsik, netfilter-devel On Tue, Jun 25, 2002 at 04:51:37PM +0200, Jean-Michel Hemstedt wrote: > > both of this is already true. look at the module loadtime parameters of > > ip_conntrack.o and iptable_nat.o > > right for conntrack, but i can't find something similar for nat: strange. I though we already had that. > conntrack: > ---------- > > but in ip_conntrack_init(): > 1093 ip_conntrack_max = 8 * ip_conntrack_htable_size; > => when the module is loaded, it is loaded with this default value. > could be good to have it as loadable parameter in order to > save it and restore in modules.conf wheres the problem with having a 'echo 12345 > /proc/sys/net/ipv4/ip_conntrack_max' in the post-load script in modules.conf? > nat: > ---- > (from ip_nat_init): > - ip_nat_htable_size = ip_conntrack_htable_size; (not configurable) > : allocated at init twice > (for bysource and byisproto hashes) > - max tuples??? haven't found any value neither any config data. > (is it in patch-o-matic)? > but the tuples are allocated on demand. > > > PS: the fact that tuples are allocated on demand (392bytes/tuple) and not at > init a tuple does not have 329 bytes. this sounds more like the size of a struct ip_conntrack. -- Live long and prosper - Harald Welte / laforge@gnumonks.org http://www.gnumonks.org/ ============================================================================ GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*) ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-25 13:21 ` Jean-Michel Hemstedt 2002-06-25 13:51 ` Harald Welte @ 2002-06-25 13:52 ` Patrick Schaaf 2002-06-25 14:53 ` Jozsef Kadlecsik 2 siblings, 0 replies; 33+ messages in thread From: Patrick Schaaf @ 2002-06-25 13:52 UTC (permalink / raw) To: Jean-Michel Hemstedt; +Cc: netfilter-devel Jean-Michel, > PS: could anybody redo similar tests so that we can compare the results > and stop killing the messenger, please? ;o) Just so you don't get the wrong impression: I am not trying to shoot the messenger, I'm trying to shoot incomplete messages. Please, don't become discouraged in further investigating the situation! best regards Patrick ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-25 13:21 ` Jean-Michel Hemstedt 2002-06-25 13:51 ` Harald Welte 2002-06-25 13:52 ` Patrick Schaaf @ 2002-06-25 14:53 ` Jozsef Kadlecsik 2002-06-25 15:22 ` Balazs Scheidler 2 siblings, 1 reply; 33+ messages in thread From: Jozsef Kadlecsik @ 2002-06-25 14:53 UTC (permalink / raw) To: Jean-Michel Hemstedt; +Cc: Harald Welte, netfilter-devel On Tue, 25 Jun 2002, Jean-Michel Hemstedt wrote: > > From where do you think that the module usage counter reports how many > > packets/connections are handled (currently? totally?) by the module. > > There is no whatsoever connection! > > module usage counter increases when a TARGET needs it (i.e. ipt_REDIRECT). Yes, this is true for netfilter target/match modules. But even in that case, the number refers how many rules use the module and not how many packets were processed. > In this test, no rule was defined, and no target module was loaded. There are always an implicit rule in the case of NAT. Being an implicit rule, it is not counted in the module usage counter. :-) > So I did not expect NAT to process any packet. No, NAT always processes all packets, the same way as conntrack does. > I suppose that due to the load, packets are dropped not because of conntrack > but because they simply can't be processed, and thus conntrack misses packets > of existing connections (such as FIN, RST) and can't thus recover due to its > timeouts. If conntrack missed packets such a way, then the destination would miss as well and the sender should resend them. No problem. > > When you issued "iptables -t nat -L", the system tried to reserve plus > > 2x75MB. That's in total pretty near to all your available physical RAM > > and the machine might died in swapping. > > exact! > That's why I looked (but not closely) at swap-in/swap-out in procinfo, > but didn't notice anything (0 most of the time on 10 sec average). > But I agree that I was close to the limit, and even over when I tried 32K. > Despite that, nothing so surpising to have so few swaps, since my table > was not full (max 4000 up to 10000 concurrent tuples). But the whole space gets reserved! Immediately as the module loaded! An it is non-swappable RAM, everything else would get the rest. > PS: could anybody redo similar tests so that we can compare the results > and stop killing the messenger, please? ;o) Sorry if I look harsh, it's not my intention at all. We were simply over almost exaclty the same arguments several times. And those resulted neither pinpointing real flaws in the system, nor better algorithms. Regards, Jozsef - E-mail : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu WWW-Home: http://www.kfki.hu/~kadlec Address : KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-25 14:53 ` Jozsef Kadlecsik @ 2002-06-25 15:22 ` Balazs Scheidler 0 siblings, 0 replies; 33+ messages in thread From: Balazs Scheidler @ 2002-06-25 15:22 UTC (permalink / raw) To: Jozsef Kadlecsik; +Cc: Jean-Michel Hemstedt, Harald Welte, netfilter-devel On Tue, Jun 25, 2002 at 04:53:33PM +0200, Jozsef Kadlecsik wrote: > On Tue, 25 Jun 2002, Jean-Michel Hemstedt wrote: > > PS: could anybody redo similar tests so that we can compare the results > > and stop killing the messenger, please? ;o) > > Sorry if I look harsh, it's not my intention at all. We were simply over > almost exaclty the same arguments several times. And those resulted > neither pinpointing real flaws in the system, nor better algorithms. no only head pointers for hashes are preallocated. conntrack structures themselves are allocated by the slab allocator: kmem_cache_alloc() called in init_conntrack() which initializes a single conntrack entry. So the initial memory allocations for conntrack and nat are conntrack: htable_size * 8 (8 is sizeof(list_head)) nat: 2 * htable_size * 8 -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-20 19:48 performance issues (nat / conntrack) Jean-Michel Hemstedt 2002-06-22 16:51 ` Harald Welte @ 2002-06-25 10:35 ` Jozsef Kadlecsik 2002-06-25 12:42 ` Jean-Michel Hemstedt 1 sibling, 1 reply; 33+ messages in thread From: Jozsef Kadlecsik @ 2002-06-25 10:35 UTC (permalink / raw) To: Jean-Michel Hemstedt; +Cc: netfilter-devel On Thu, 20 Jun 2002, Jean-Michel Hemstedt wrote: > I'm doing some tcp benches on a netfilter enabled box and noticed > huge and surprising perf decrease when loading iptable_nat module. > > - ip_conntrack is of course also loading the system, but with huge memory > and a large bucket size, the problem can be solved. The big issue with > ip_conntrack are the state timeouts: it simply kill the system and drops > all the traffic with the default ones, because the ip_conntrack table > becomes quickly full, and it seems that there is no way to recover from > that situation... Keeping unused entries (time_close) even 1 minute in > the cache is really not suitable for configurations handling (relatively) > large number of connections/s. Please note: the role of the conntrack subsystem is to keep track of the connections. As good as possible. If the conntrack table becomes full, there are two possibilities: - conntrack table size is underestimated for the real traffic flowing trough. Get more RAM and increase the table size. - conntrack is under a (DoS) attack. Then protect conntrack by appropriate rules using the recent/limit/psd etc modules. I'm against in changing the *default* timeout values, except when it is based on real-life, well established cases. > o The cumulative effect should be reconsidered. > o Are there ways/plans to tune the timeouts dynamically? and what are > the valid/invalid ranges of timeouts? There is already a patch in p-o-m which makes possible to *tune* the timeouts dynamically via /proc. Actually, the only reason why that part of the patch was written was to make possible to dynamically *increase* the timeout value of the close_wait state. > - The annoying point is iptable_nat: normally the number of entries in > the nat table is much lower than the number of entries in the conntrack > table. So even if the hash function itself could be less efficient than > the ip_conntrack one (because it takes less arguments: src+dst+proto), > the load of nat, should be much lower than the load of conntrack. If there is no explicit NAT rule for a connection, then automatic NULL mapping happens. (Also, because NAT keeps two additional hashes, the total amount of memory required for the data is 3*ip_conntrack_htable_size.) The book-keeping overhead is at least doubled compared to the conntrack-only case - this explains pretty well the results you got. > - Another (old) question: why are conntrack or nat active when there are > no rules configured (using them or not)? If not fixed it should be at > least documented... Somebody doing "iptables -t nat -L" takes the risk conntrack and nat are subsystems. If somebody loads them in, then they start to work. But why would anyone type in "iptables -t nat -L" when in reality he/she does not use nat and the nat table itself?? > here is my test bed: > > tested target: > -kernel 2.4.18 + non_local_bind + small conntrack timeouts... > -PIII~500MHz, RAM=256MB > -2*100Mb/s NIC > > The target acts as a forwarding gateway between a load generator client > running httperf, and an apache proxy serving cached pages. 100Mb/s NICs > and requests/response sizes insure that BW and packet collisions is not > an issue. > > Since in my test, each connection is ephemeral (<10ms), i recompiled the > kernel with very short conntrack timeouts (i.e: 1 sec for close_wait, > and about 60 sec for established!) This was also the only way to restrict > the conntrack hash table size (given my RAM) and avoid exagerated hash > collisions. Another limitation comes from my load generator creating traffic > from one source to one destination ipa, with only source port variation > (but given my configured hash table size and the hash function itself > it shouldn't have been an issue). I think because only the source port varies, this is an important issue in your setup. You actually tested the hash functions and could bomb some hash entries. The overall effect was a DoS against conntrack. Regards, Jozsef - E-mail : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu WWW-Home: http://www.kfki.hu/~kadlec Address : KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-25 10:35 ` Jozsef Kadlecsik @ 2002-06-25 12:42 ` Jean-Michel Hemstedt 2002-06-25 13:50 ` Patrick Schaaf ` (3 more replies) 0 siblings, 4 replies; 33+ messages in thread From: Jean-Michel Hemstedt @ 2002-06-25 12:42 UTC (permalink / raw) To: Jozsef Kadlecsik; +Cc: netfilter-devel > > I'm doing some tcp benches on a netfilter enabled box and noticed > > huge and surprising perf decrease when loading iptable_nat module. > > > > - ip_conntrack is of course also loading the system, but with huge memory > > and a large bucket size, the problem can be solved. The big issue with > > ip_conntrack are the state timeouts: it simply kill the system and drops > > all the traffic with the default ones, because the ip_conntrack table > > becomes quickly full, and it seems that there is no way to recover from > > that situation... Keeping unused entries (time_close) even 1 minute in > > the cache is really not suitable for configurations handling (relatively) > > large number of connections/s. > > Please note: the role of the conntrack subsystem is to keep track of the > connections. As good as possible. If the conntrack table becomes full, > there are two possibilities: > > - conntrack table size is underestimated for the real traffic flowing > trough. Get more RAM and increase the table size. > - conntrack is under a (DoS) attack. Then protect conntrack by appropriate > rules using the recent/limit/psd etc modules. And what if, under load conditions, your table becomes full because 90% of its entries, which are unused, are not aged because of timeouts? We don't even need to have a full table to get into troubles. If at one point, the vast majority of the conntrack entries are unused, but still in hash, then you get more and more collisions, which decreases the hash efficiency. There's another side effect: when the system get's loaded (because of hash exhaustion or hash collisions), it can't process all packets arriving which means that conntrack will not see some FIN or RST packets allowing it to recover... This is a kind of 'vicious circle', or point of failure. In my opinion, a first step should be to reconsider timeout values but also timer mechanisms. > > I'm against in changing the *default* timeout values, except when it is > based on real-life, well established cases. What sounds the most significant: 'TCP timeouts' or 'application timeouts'? Should (i.e) HTTP, FTP and Telnet have the same lifetime in hash? > > > o The cumulative effect should be reconsidered. > > o Are there ways/plans to tune the timeouts dynamically? and what are > > the valid/invalid ranges of timeouts? > > There is already a patch in p-o-m which makes possible to *tune* the > timeouts dynamically via /proc. Actually, the only reason why that part of > the patch was written was to make possible to dynamically *increase* the > timeout value of the close_wait state. I didn't know that. thanks for the info. But unfortunately it doesn't meet my 'timeout per protocol' needs. > > > - The annoying point is iptable_nat: normally the number of entries in > > the nat table is much lower than the number of entries in the conntrack > > table. So even if the hash function itself could be less efficient than > > the ip_conntrack one (because it takes less arguments: src+dst+proto), > > the load of nat, should be much lower than the load of conntrack. > > If there is no explicit NAT rule for a connection, then automatic NULL > mapping happens. (Also, because NAT keeps two additional hashes, the total > amount of memory required for the data is 3*ip_conntrack_htable_size.) indeed, this dimensioning is quite conservative, and it assumes that conntrack is distributed on src+dst+proto, not on ports. But we can live with that, since it's only a memory overhead (except if we start considering memory pages swapping). > > The book-keeping overhead is at least doubled compared to the > conntrack-only case - this explains pretty well the results you got. what do you mean by 'book-keeping' ? Does NAT do a lookup even if there are no rules? > > > - Another (old) question: why are conntrack or nat active when there are > > no rules configured (using them or not)? If not fixed it should be at > > least documented... Somebody doing "iptables -t nat -L" takes the risk > > conntrack and nat are subsystems. If somebody loads them in, then they > start to work. > work on what, since NAT has nothing to translate? > But why would anyone type in "iptables -t nat -L" when in reality he/she > does not use nat and the nat table itself?? (why do we live if it's for dying in the end?) > > > here is my test bed: > > > > tested target: > > -kernel 2.4.18 + non_local_bind + small conntrack timeouts... > > -PIII~500MHz, RAM=256MB > > -2*100Mb/s NIC > > > > The target acts as a forwarding gateway between a load generator client > > running httperf, and an apache proxy serving cached pages. 100Mb/s NICs > > and requests/response sizes insure that BW and packet collisions is not > > an issue. > > > > Since in my test, each connection is ephemeral (<10ms), i recompiled the > > kernel with very short conntrack timeouts (i.e: 1 sec for close_wait, > > and about 60 sec for established!) This was also the only way to restrict > > the conntrack hash table size (given my RAM) and avoid exagerated hash > > collisions. Another limitation comes from my load generator creating traffic > > from one source to one destination ipa, with only source port variation > > (but given my configured hash table size and the hash function itself > > it shouldn't have been an issue). > > I think because only the source port varies, this is an important issue in > your setup. You actually tested the hash functions and could bomb some > hash entries. The overall effect was a DoS against conntrack. ok, here we go: 98 static inline u_int32_t 99 hash_conntrack(const struct ip_conntrack_tuple *tuple) 100 { 101 #if 0 102 dump_tuple(tuple); 103 #endif 104 /* ntohl because more differences in low bits. */ 105 /* To ensure that halves of the same connection don't hash 106 clash, we add the source per-proto again. */ 107 return (ntohl(tuple->src.ip + tuple->dst.ip 108 + tuple->src.u.all + tuple->dst.u.all 109 + tuple->dst.protonum) 110 + ntohs(tuple->src.u.all)) 111 % ip_conntrack_htable_size; 112 } src.u.all & dst.u.all refer (unless there's a bug) to src.tcp.port and dst.tcp.port respectively. So, if only src.port varies linearly (let's say between 32000 and 64000), and if ip_conntrack_htable_size = 32768 (kernel: ip_conntrack (32768 buckets, 262144 max)), then we should have maximum 2 collisions per bucket (unless there's a type overfow somewhere). This was my test setup, but since I haven't verified the conntrack hash distribution, I didn't want to argue on that. To measure that, we should maintain hash counters such as max collisions, average collisions per key, hit/miss depth average, number of hit/miss per second, etc... I've planned to do that along with profiling, but unfortunately not in the 2 coming weeks. -- last points I wanted to clarify: > From: "Patrick Schaaf" <bof@bof.de> > On Sun, Jun 23, 2002 at 09:46:29PM -0700, Don Cohen wrote: > > > From: "Jean-Michel Hemstedt" <jean-michel.hemstedt@alcatel.be> > > > > > Since in my test, each connection is ephemeral (<10ms) ... > > > > One question here is whether the traffic generator is acting like > > a real set of users or like an attacker. A real user would not keep > > trying to make connections at the same rate if the previous attempts > > were not being served. I suspect you're acting more like an attacker. > > He definitely is. The test he described is completely artificial, and does > not represent any normal real world workload. > > Nevertheless, it does point out a valid optimization chance. We discussed > that months ago, and it's still there. No, I don't think so. 1) the hash is not in cause (see above) (btw, as discussed in 'connection tracking scaling' [19 March 2002] i don't see ways to really optimize it unless you go for multidimesional hashes described in theoretical papers, or if you make traffic assumptions which is most likely impossible in such a generic framework...) However, I don't understand why we are adding twice the src.port in the hash function? 2) My test was artificial, but not unrealistic: one endpoint sustaining 1000 conn/s wathever the responsiveness of the target, or 10000 users trying to connect through the gw in a time lapse of 10 seconds is similar. Now, if some of you are telling me that I'm not allowed, or that I'm nuts to place my box in front of 10000 users, that's another debate. I'm not talking about dimensioning, I'm talking about relative performances, and strange weaknesses. kr, -jmhe- ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-25 12:42 ` Jean-Michel Hemstedt @ 2002-06-25 13:50 ` Patrick Schaaf 2002-06-25 19:03 ` Harald Welte 2002-06-25 13:56 ` Alex Bennee ` (2 subsequent siblings) 3 siblings, 1 reply; 33+ messages in thread From: Patrick Schaaf @ 2002-06-25 13:50 UTC (permalink / raw) To: Jean-Michel Hemstedt; +Cc: netfilter-devel > In my opinion, a first step should be to reconsider timeout values but > also timer mechanisms. No. A first step MUST be pointing out that the current timeouts become a problem in REAL LIFE. Right now you are speculating. On all setups I personally know, the timeouts are NOT a problem. Regarding timer _mechanisms_ I have seen no indication at all that the current mechanism is a problem. If you want to insist, _please_ learn about kernel profiling, and start posting FACT. regards Patrick ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-25 13:50 ` Patrick Schaaf @ 2002-06-25 19:03 ` Harald Welte 0 siblings, 0 replies; 33+ messages in thread From: Harald Welte @ 2002-06-25 19:03 UTC (permalink / raw) To: Patrick Schaaf; +Cc: Jean-Michel Hemstedt, netfilter-devel On Tue, Jun 25, 2002 at 03:50:38PM +0200, Patrick Schaaf wrote: > > In my opinion, a first step should be to reconsider timeout values but > > also timer mechanisms. > > No. A first step MUST be pointing out that the current timeouts become > a problem in REAL LIFE. Right now you are speculating. On all setups > I personally know, the timeouts are NOT a problem. > > Regarding timer _mechanisms_ I have seen no indication at all that the > current mechanism is a problem. If you want to insist, _please_ learn > about kernel profiling, and start posting FACT. I've been talking about this with a couple of people here at the kernel summit, and it looks like the per-packet del_timer/add_timer in ip_ct_refresh should be a severe performance hit on SMP boxes. Changing this to 'do not update timer if update would be < HZ different than current timer' is a two-line patch. As stated before, I'm currently away of my testing equipment, so if anybody wants to give it a try... > regards > Patrick -- Live long and prosper - Harald Welte / laforge@gnumonks.org http://www.gnumonks.org/ ============================================================================ GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*) ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-25 12:42 ` Jean-Michel Hemstedt 2002-06-25 13:50 ` Patrick Schaaf @ 2002-06-25 13:56 ` Alex Bennee 2002-06-25 14:17 ` Jozsef Kadlecsik 2002-06-25 19:01 ` Harald Welte 3 siblings, 0 replies; 33+ messages in thread From: Alex Bennee @ 2002-06-25 13:56 UTC (permalink / raw) To: netfilter-devel Jean-Michel Hemstedt said: > In my opinion, a first step should be to reconsider timeout values but > also timer mechanisms. I've been following this thread with interest as I recently also had conntrack related problems (failing to establish new connections due to the table being full). My machine is resource contrained (28M RAM) as its only an ADSL gateway yet when I count the number of connections its tracking it varies between 300->600 connections which bare little relation to what it should be. I excacerbate the problem by running gtk-gnutella which entertains a lot of short lived incomming connections that get closed by the application but still create long-lived conntrack entries. >> I'm against in changing the *default* timeout values, except when it >> is based on real-life, well established cases. > > What sounds the most significant: 'TCP timeouts' or 'application > timeouts'? Should (i.e) HTTP, FTP and Telnet have the same lifetime in > hash? Maybe a iptables marking approach (a-la tc)? Alex www.bennee.com/~alex/ ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-25 12:42 ` Jean-Michel Hemstedt 2002-06-25 13:50 ` Patrick Schaaf 2002-06-25 13:56 ` Alex Bennee @ 2002-06-25 14:17 ` Jozsef Kadlecsik 2002-06-25 15:13 ` Balazs Scheidler 2002-06-27 2:21 ` Andrew Smith 2002-06-25 19:01 ` Harald Welte 3 siblings, 2 replies; 33+ messages in thread From: Jozsef Kadlecsik @ 2002-06-25 14:17 UTC (permalink / raw) To: Jean-Michel Hemstedt; +Cc: netfilter-devel On Tue, 25 Jun 2002, Jean-Michel Hemstedt wrote: > > connections. As good as possible. If the conntrack table becomes full, > > there are two possibilities: > > > > - conntrack table size is underestimated for the real traffic flowing > > trough. Get more RAM and increase the table size. > > - conntrack is under a (DoS) attack. Then protect conntrack by appropriate > > rules using the recent/limit/psd etc modules. > > And what if, under load conditions, your table becomes full because 90% of > its entries, which are unused, are not aged because of timeouts? The only case when that might happen is a DoS. You did not consider the second point above. > We don't even need to have a full table to get into troubles. If at one > point, the vast majority of the conntrack entries are unused, but still > in hash, then you get more and more collisions, which decreases the > hash efficiency. What kind of collisions? Do you mean, that we end up in the same hash entry and the linked list in the entry becomes too long? There is not much wizardy we can do about it: - increase the hash size (i.e buy more RAM) if the hash is small - create better hash function, if one can deliberately hit the same entry. By the way, so far nobody has ever proved that the hash function is not good enough. > There's another side effect: when the system get's loaded (because of > hash exhaustion or hash collisions), it can't process all packets arriving > which means that conntrack will not see some FIN or RST packets allowing > it to recover... This is a kind of 'vicious circle', or point of failure. This is not true. If those FIN/RST packets belong to already existing connections, then those are in the conntrack hash and data can be updated. If those packets do not belong to an existing connection, then either they can create a new entry and we are fine, or conntrack is full and the packets will be dropped - we are fine again. > In my opinion, a first step should be to reconsider timeout values but > also timer mechanisms. As Patric already wrote: there is still no proof that the timeout values are wrong. > > I'm against in changing the *default* timeout values, except when it is > > based on real-life, well established cases. > > What sounds the most significant: 'TCP timeouts' or 'application timeouts'? > Should (i.e) HTTP, FTP and Telnet have the same lifetime in hash? Sorry, I have the impression that you do not know how conntrack works, how conntrack entries created, updated and destroyed. Applications get the same timeouts, but their lifetime (and even that of the different connections of the same application) can be quite different. > > The book-keeping overhead is at least doubled compared to the > > conntrack-only case - this explains pretty well the results you got. > > what do you mean by 'book-keeping' ? > Does NAT do a lookup even if there are no rules? I have to write again: even if there are no any rules, NULL mapping happens and new connections must be put into both nat hashes. > > conntrack and nat are subsystems. If somebody loads them in, then they > > start to work. > > work on what, since NAT has nothing to translate? See above. > > But why would anyone type in "iptables -t nat -L" when in reality he/she > > does not use nat and the nat table itself?? > > (why do we live if it's for dying in the end?) If somebody want to shoot himself in the foot, we can give him even more rope :-). > > I think because only the source port varies, this is an important issue in > > your setup. You actually tested the hash functions and could bomb some > > hash entries. The overall effect was a DoS against conntrack. > This was my test setup, but since I haven't verified the conntrack hash > distribution, I didn't want to argue on that. To measure that, we should > maintain hash counters such as max collisions, average collisions per > key, hit/miss depth average, number of hit/miss per second, etc... > I've planned to do that along with profiling, but unfortunately not in > the 2 coming weeks. In my opinion, this is the real question. But I repeat again, nobody proved that the hash function is not good enough. It's only speculation. Regards, Jozsef - E-mail : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu WWW-Home: http://www.kfki.hu/~kadlec Address : KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-25 14:17 ` Jozsef Kadlecsik @ 2002-06-25 15:13 ` Balazs Scheidler 2002-06-25 19:06 ` Harald Welte 2002-06-27 2:21 ` Andrew Smith 1 sibling, 1 reply; 33+ messages in thread From: Balazs Scheidler @ 2002-06-25 15:13 UTC (permalink / raw) To: Jozsef Kadlecsik; +Cc: Jean-Michel Hemstedt, netfilter-devel On Tue, Jun 25, 2002 at 04:17:54PM +0200, Jozsef Kadlecsik wrote: > On Tue, 25 Jun 2002, Jean-Michel Hemstedt wrote: > > > The book-keeping overhead is at least doubled compared to the > > > conntrack-only case - this explains pretty well the results you got. > > > > what do you mean by 'book-keeping' ? > > Does NAT do a lookup even if there are no rules? > > I have to write again: even if there are no any rules, NULL > mapping happens and new connections must be put into both nat hashes. This should not explain the performance degradation others found. If no rules are found in the table, the conntrack entry is added to the NAT hashes. (place_in_hashes() function), this involves adding the entry to two linked lists (changes two pointers per list), and then calling do_bindings() which does nothing (num_manips == 0) except for calling helpers, which should be none, if helper modules are not loaded. Adding entries to the NAT hashes doesn't involve memory allocation (NAT info is stored in ip_conntrack), therefore I don't see the reason for the 50% performance decrease. -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-25 15:13 ` Balazs Scheidler @ 2002-06-25 19:06 ` Harald Welte 2002-06-26 8:18 ` Balazs Scheidler 0 siblings, 1 reply; 33+ messages in thread From: Harald Welte @ 2002-06-25 19:06 UTC (permalink / raw) To: Balazs Scheidler; +Cc: Jozsef Kadlecsik, Jean-Michel Hemstedt, netfilter-devel On Tue, Jun 25, 2002 at 05:13:02PM +0200, Balazs Scheidler wrote: > On Tue, Jun 25, 2002 at 04:17:54PM +0200, Jozsef Kadlecsik wrote: > > On Tue, 25 Jun 2002, Jean-Michel Hemstedt wrote: > > > > The book-keeping overhead is at least doubled compared to the > > > > conntrack-only case - this explains pretty well the results you got. > > > > > > what do you mean by 'book-keeping' ? > > > Does NAT do a lookup even if there are no rules? > > > > I have to write again: even if there are no any rules, NULL > > mapping happens and new connections must be put into both nat hashes. > > This should not explain the performance degradation others found. If no > rules are found in the table, the conntrack entry is added to the NAT > hashes. (place_in_hashes() function), this involves adding the entry to two > linked lists (changes two pointers per list), and then calling do_bindings() > which does nothing (num_manips == 0) except for calling helpers, which > should be none, if helper modules are not loaded. > > Adding entries to the NAT hashes doesn't involve memory allocation (NAT info > is stored in ip_conntrack), therefore I don't see the reason for the 50% > performance decrease. think about the lock contention on SMP system. The 'null binding' approach for nat (and for example, that nat helpers are called for connections with 'null binding') is a poor design. I've recently did some testing which try to avoid the null binding, but as I'm not entirely sure they don't break something else I haven't been releasing them yet. > Bazsi -- Live long and prosper - Harald Welte / laforge@gnumonks.org http://www.gnumonks.org/ ============================================================================ GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*) ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-25 19:06 ` Harald Welte @ 2002-06-26 8:18 ` Balazs Scheidler 0 siblings, 0 replies; 33+ messages in thread From: Balazs Scheidler @ 2002-06-26 8:18 UTC (permalink / raw) To: Harald Welte, Jozsef Kadlecsik, Jean-Michel Hemstedt, netfilter-devel On Tue, Jun 25, 2002 at 09:06:47PM +0200, Harald Welte wrote: > On Tue, Jun 25, 2002 at 05:13:02PM +0200, Balazs Scheidler wrote: > > On Tue, Jun 25, 2002 at 04:17:54PM +0200, Jozsef Kadlecsik wrote: > > > On Tue, 25 Jun 2002, Jean-Michel Hemstedt wrote: > > > > what do you mean by 'book-keeping' ? > > > > Does NAT do a lookup even if there are no rules? > > > > > > I have to write again: even if there are no any rules, NULL > > > mapping happens and new connections must be put into both nat hashes. > > > > This should not explain the performance degradation others found. If no > > rules are found in the table, the conntrack entry is added to the NAT > > hashes. (place_in_hashes() function), this involves adding the entry to two > > linked lists (changes two pointers per list), and then calling do_bindings() > > which does nothing (num_manips == 0) except for calling helpers, which > > should be none, if helper modules are not loaded. > > > > Adding entries to the NAT hashes doesn't involve memory allocation (NAT info > > is stored in ip_conntrack), therefore I don't see the reason for the 50% > > performance decrease. > > think about the lock contention on SMP system. The 'null binding' > approach for nat (and for example, that nat helpers are called for > connections with 'null binding') is a poor design. > > I've recently did some testing which try to avoid the null binding, but > as I'm not entirely sure they don't break something else I haven't been > releasing them yet. The original test machine used to gather performance information was not SMP: " here is my test bed: tested target: -kernel 2.4.18 + non_local_bind + small conntrack timeouts... -PIII~500MHz, RAM=256MB -2*100Mb/s NIC " -- Bazsi PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1 ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-25 14:17 ` Jozsef Kadlecsik 2002-06-25 15:13 ` Balazs Scheidler @ 2002-06-27 2:21 ` Andrew Smith 2002-06-27 11:24 ` Harald Welte 1 sibling, 1 reply; 33+ messages in thread From: Andrew Smith @ 2002-06-27 2:21 UTC (permalink / raw) To: netfilter-devel > On Tue, 25 Jun 2002, Jean-Michel Hemstedt wrote: > >> > connections. As good as possible. If the conntrack table becomes >> > full, there are two possibilities: >> > >> > - conntrack table size is underestimated for the real traffic >> > flowing >> > trough. Get more RAM and increase the table size. >> > - conntrack is under a (DoS) attack. Then protect conntrack by >> > appropriate >> > rules using the recent/limit/psd etc modules. >> >> And what if, under load conditions, your table becomes full because >> 90% of its entries, which are unused, are not aged because of >> timeouts? > > The only case when that might happen is a DoS. You did not consider the > second point above. <snip> I've mentioned this before but since I'm not an actual developer in the netfilter arena I assume it got ignored (and will again) but I can suggest what appears to me to be a common cause of this problem - online gaming. The specific game that causes this the most is a game called CounterStrike. It is a mod of a game called Half-Life which is handled online by Sierra. When you want to play online your computer will talk to one of the Sierra servers (there is 3 of them I think) that controls any known games that are created via the same process and the Sierra server will reply with a list of IP addresses of online game servers - anywhere from about 5,000 to 20,000 during peak times (my guess at an average would be around 10,000) Your PC will then usually 'ping' each of the game servers (yes all X thousand of them) as quickly as possible to determine the response times you will get if you play on that server. This 'ping' connection does end up in the conntack table (I call it a 'ping' coz I've never bothered to check what it really is and it doesn't matter anyway - it ends up in the conntrack table is all that matters) There are plenty of other similar games but CoutnerStrike is the most popular and thus its numbers are larger than any other game but most are only an order of magnitude smaller - e.g. QuakeI, II & III, Tribes 2, Medal Of Honour etc. The number of players online is usualy between 5 & 10 times the number of game servers running. This gives a good example when being able to set the timeout dependant upon specific factors (e.g. port/protocol) would be good rather than a global timeout that suits specific cases and does not match many cases - and causes a severe problem for a limited set of cases -- -Cheers -Andrew MS ... if only he hadn't been hang gliding! ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-27 2:21 ` Andrew Smith @ 2002-06-27 11:24 ` Harald Welte 2002-06-29 5:25 ` Andrew Smith 0 siblings, 1 reply; 33+ messages in thread From: Harald Welte @ 2002-06-27 11:24 UTC (permalink / raw) To: Andrew Smith; +Cc: netfilter-devel On Thu, Jun 27, 2002 at 12:21:45PM +1000, Andrew Smith wrote: > This gives a good example when being able to set the timeout dependant > upon specific factors (e.g. port/protocol) would be good rather than a > global timeout that suits specific cases and does not match many cases > - and causes a severe problem for a limited set of cases Sorry, but we've had this discussion over and over again. Go to the list archives and look for tuneable timeouts. The conclusion of this discussion was, that we need to cope with all cases without any tuning being necessarry. btw: For the 'ping' case, the icmp echo reply is closing the connection anyway. conntrack is mostly about tracking layer 3+4 protocol state. And this should happen as transparent as possible, so assumptions about the application are made. [conntrack helpers are an exemption, and be sure I would be much happier if we didn't need to have them]. > -Cheers > -Andrew -- Live long and prosper - Harald Welte / laforge@gnumonks.org http://www.gnumonks.org/ ============================================================================ GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*) ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-27 11:24 ` Harald Welte @ 2002-06-29 5:25 ` Andrew Smith 0 siblings, 0 replies; 33+ messages in thread From: Andrew Smith @ 2002-06-29 5:25 UTC (permalink / raw) To: netfilter-devel > On Thu, Jun 27, 2002 at 12:21:45PM +1000, Andrew Smith wrote: >> This gives a good example when being able to set the timeout dependant >> upon specific factors (e.g. port/protocol) would be good rather than a >> global timeout that suits specific cases and does not match many cases >> - and causes a severe problem for a limited set of cases > > Sorry, but we've had this discussion over and over again. Go to the > list archives and look for tuneable timeouts. > > The conclusion of this discussion was, that we need to cope with all > cases without any tuning being necessarry. Well either there is a language mistake or that statement is rubbish. It does NOT cope with all cases. If fails dismally with the case I've given. It is not POSSIBLE to cope with all cases without any tuning being necessary unless the code tuned itself. Pity that the conclusion is flawed. > btw: For the 'ping' case, the icmp echo reply is closing the connection > anyway. So I guess I need to look in detail what is happening in my case - but at a guess the problem might be that a large number of the connections fail to get a fast enough response and thus do not get closed for a 'long' time. > conntrack is mostly about tracking layer 3+4 protocol state. And this > should happen as transparent as possible, so assumptions about the > application are made. [conntrack helpers are an exemption, and be sure > I would be much happier if we didn't need to have them]. > - Harald Welte / laforge@gnumonks.org Yes but the problem is that it causes problems at a higher protocol level and though it works for most cases - it fails on at least a few specific cases. Anyway - this argument will not get anywhere. I guess some time (in the far distant future :-) when I have the time and inclination I'll fix it myself and then just have to keep patching it every time it's updated - coz the comments certainly suggest that a patch would not be accepted here. -- -Cheers -Andrew MS ... if only he hadn't been hang gliding! ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: performance issues (nat / conntrack) 2002-06-25 12:42 ` Jean-Michel Hemstedt ` (2 preceding siblings ...) 2002-06-25 14:17 ` Jozsef Kadlecsik @ 2002-06-25 19:01 ` Harald Welte 2002-06-25 20:53 ` conntrack DoS Henrik Nordstrom 3 siblings, 1 reply; 33+ messages in thread From: Harald Welte @ 2002-06-25 19:01 UTC (permalink / raw) To: Jean-Michel Hemstedt; +Cc: Jozsef Kadlecsik, netfilter-devel On Tue, Jun 25, 2002 at 02:42:36PM +0200, Jean-Michel Hemstedt wrote: > There's another side effect: when the system get's loaded (because of > hash exhaustion or hash collisions), it can't process all packets arriving > which means that conntrack will not see some FIN or RST packets allowing > it to recover... This is a kind of 'vicious circle', or point of failure. if conntrack doesn't see a FIN or RST packet, it won't be forwarded by the machine and thus never arrive at the receiver. The sender will thus retransmit, and hope the packet makes it next time. > In my opinion, a first step should be to reconsider timeout values but > also timer mechanisms. no, the timeout values are reasonable. > > I'm against in changing the *default* timeout values, except when it is > > based on real-life, well established cases. > > What sounds the most significant: 'TCP timeouts' or 'application timeouts'? > Should (i.e) HTTP, FTP and Telnet have the same lifetime in hash? yes, they should. They are TCP connections. We shouldn't impose any application-protocol specific layer4 timeouts, that sounds horrible. the port versus application protocol (i.e. 80 == http) are by convention, not by protocol design. > But unfortunately it doesn't meet my 'timeout per protocol' needs. well, so go ahead and implement it. nobody prevents you from doing that. > indeed, this dimensioning is quite conservative, and it assumes that > conntrack is distributed on src+dst+proto, not on ports. But we can > live with that, since it's only a memory overhead (except if we start > considering memory pages swapping). kernel memory is never swapped out. > > conntrack and nat are subsystems. If somebody loads them in, then they > > start to work. > > work on what, since NAT has nothing to translate? they start the work necessary to be prepared to nat packets/connections. > > But why would anyone type in "iptables -t nat -L" when in reality he/she > > does not use nat and the nat table itself?? > > (why do we live if it's for dying in the end?) I don't know what kind of weird position you are claiming. I think it is now clear that you have a different perspective on how conntrack/nat should work. If the netfilter people respond to this as 'this is by design and not a bug', you will have to live with that or implement a different system. That's something different from improving load under DoS situations or improving conntrack performance in general, where we have the same goal. > This was my test setup, but since I haven't verified the conntrack hash > distribution, I didn't want to argue on that. To measure that, we should > maintain hash counters such as max collisions, average collisions per > key, hit/miss depth average, number of hit/miss per second, etc... > I've planned to do that along with profiling, but unfortunately not in > the 2 coming weeks. this sounds very constructive and we're looking forward to the results. > last points I wanted to clarify: > > 2) My test was artificial, but not unrealistic: one endpoint sustaining > 1000 conn/s wathever the responsiveness of the target, or 10000 users > trying to connect through the gw in a time lapse of 10 seconds is > similar. > Now, if some of you are telling me that I'm not allowed, or that I'm nuts > to place my box in front of 10000 users, that's another debate. > I'm not talking about dimensioning, I'm talking about relative > performances, and strange weaknesses. conntrack should definitely be able to handle this case and I'm looking forward to see detailed results. I'm away from my testing equipment for almost three weeks, so I cannot really reproduce or try to verify any of your claims, neither reject them. It should at least deal with 10kconn/s > kr, > -jmhe- -- Live long and prosper - Harald Welte / laforge@gnumonks.org http://www.gnumonks.org/ ============================================================================ GCS/E/IT d- s-: a-- C+++ UL++++$ P+++ L++++$ E--- W- N++ o? K- w--- O- M- V-- PS+ PE-- Y+ PGP++ t++ 5-- !X !R tv-- b+++ DI? !D G+ e* h+ r% y+(*) ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: conntrack DoS 2002-06-25 19:01 ` Harald Welte @ 2002-06-25 20:53 ` Henrik Nordstrom 2002-06-25 21:47 ` Jozsef Kadlecsik 0 siblings, 1 reply; 33+ messages in thread From: Henrik Nordstrom @ 2002-06-25 20:53 UTC (permalink / raw) To: Harald Welte, Jean-Michel Hemstedt; +Cc: Jozsef Kadlecsik, netfilter-devel Harald Welte wrote: > if conntrack doesn't see a FIN or RST packet, it won't be forwarded by > the machine and thus never arrive at the receiver. The sender will thus > retransmit, and hope the packet makes it next time. FIN will be retransmitted a couple of times, but RST won't. RST will only be retransmitted indirectly if data arrives from the other end. This brings me back to the important cases where TCP closes down without conntrack noticing. Have tried to bring this up a couple of times before but I do not recall seeing much of a response.. There exists at least two real-life cenarios where conntrack won't notice that a TCP connection is gone: If the RST is lost on a aborted connection where the other endpoint has disappeared. when the TCP is dropped due to retransmission timeouts after one of the endpoints have disappeared. The two cases above is only slight variations of the same scenario. A TCP is established, then one of the endpoints disappears from the network preventing the connection to be shut down in a normal manner. The first very uncommon and probably not easily exploitable. The second case is a real problem and can quite easily be used to DoS conntrack with a relatively small amount of packets. (total ~20 minimum size TCP packets per wasted conntrack entry) The good news is that second case can be detected by detecting a long term uni-directional packet flow, and we probably do not need to care about the first.. A simple test case illustrating the second case: Have three machines A <-> conntrack server <-> B On A, set up the following two simple rules to simulate connection dropout -A OUPUT -p tcp --tcp-flags RST RST -j DROP -A OUPUT -p tcp --tcp-flags FIN FIN -j DROP On B, enable the chargen TCP service to have a simple TCP data source, and for the sake of accelerating the test, conserver resources of B and more obviously illustrate the point lower tcp_retries2 to something like 5. Then run the following silly test program on A while true; do telnet B chargen </dev/null >/dev/null; done Note: This is only a superficial twist of the traditional TCP connection flood DoS. The attacker finishes the SYN handshake and then ignores the connection. Can be launched on conntrack via any TCP service that returns data, causing a TCP data queue on the connection preventing a FIN to be sent when the server aborts the connection. > yes, they should. They are TCP connections. We shouldn't impose > any application-protocol specific layer4 timeouts, that sounds horrible. > > the port versus application protocol (i.e. 80 == http) are by > convention, not by protocol design. I agree. Should also note that there exists fully valid HTTP applications utilizing very long idle periods on a HTTP connection. How long the HTTP connection is kept open is a business between the user-agent and the server only, nobody else. The fact that most HTTP connections are short lived does not say that all are and that it is OK to drop idle HTTP connections. Regards Henrik ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: conntrack DoS 2002-06-25 20:53 ` conntrack DoS Henrik Nordstrom @ 2002-06-25 21:47 ` Jozsef Kadlecsik 2002-06-25 22:42 ` Henrik Nordstrom 0 siblings, 1 reply; 33+ messages in thread From: Jozsef Kadlecsik @ 2002-06-25 21:47 UTC (permalink / raw) To: Henrik Nordstrom; +Cc: Harald Welte, Jean-Michel Hemstedt, netfilter-devel On Tue, 25 Jun 2002, Henrik Nordstrom wrote: > > if conntrack doesn't see a FIN or RST packet, it won't be forwarded by > > the machine and thus never arrive at the receiver. The sender will thus > > retransmit, and hope the packet makes it next time. > > FIN will be retransmitted a couple of times, but RST won't. RST will only be > retransmitted indirectly if data arrives from the other end. Yes. > This brings me back to the important cases where TCP closes down without > conntrack noticing. Have tried to bring this up a couple of times before but > I do not recall seeing much of a response.. > > There exists at least two real-life cenarios where conntrack won't notice that > a TCP connection is gone: > > If the RST is lost on a aborted connection where the other endpoint has > disappeared. > > when the TCP is dropped due to retransmission timeouts after one of the > endpoints have disappeared. > > The two cases above is only slight variations of the same scenario. A TCP is > established, then one of the endpoints disappears from the network preventing > the connection to be shut down in a normal manner. > > The first very uncommon and probably not easily exploitable. > > The second case is a real problem and can quite easily be used to DoS > conntrack with a relatively small amount of packets. (total ~20 minimum size > TCP packets per wasted conntrack entry) > > The good news is that second case can be detected by detecting a long term > uni-directional packet flow, and we probably do not need to care about the > first.. How could be such connections (second case) sorted out from legitimate uni-directional (even half-closed) connections? Regards, Jozsef - E-mail : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu WWW-Home: http://www.kfki.hu/~kadlec Address : KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: conntrack DoS 2002-06-25 21:47 ` Jozsef Kadlecsik @ 2002-06-25 22:42 ` Henrik Nordstrom 2002-06-27 8:26 ` Jozsef Kadlecsik 0 siblings, 1 reply; 33+ messages in thread From: Henrik Nordstrom @ 2002-06-25 22:42 UTC (permalink / raw) To: Jozsef Kadlecsik; +Cc: netfilter-devel Jozsef Kadlecsik wrote: > > The good news is that second case can be detected by detecting a long term > > uni-directional packet flow, and we probably do not need to care about the > > first.. > > How could be such connections (second case) sorted out from > legitimate uni-directional (even half-closed) connections? A running TCP packet flow (even for a "half-closed" uni-directional TCP) is never uni-directional. If there is data in flowing in one direction then there is ACKs in the other direction. Idea on how conntrack could deal with such connections: If several retransmissions (lets say 5) is seen in one direction and no ACKs in the other within a reasonable timeframe (lets say 10 minutes) then the TCP is most likely dead and a low inactivity timeout can be assigned (lets say 20 minutes) to have it cleaned out from conntrack. At a first glance this can be simplified into a RETRANSMIT/ACK timeout state machinery, but there is a significant race window making a simple packet driven state machine unsuitable. Must not trigger on a delayed retransmission followed by a lost ACK, or delayed retransmissions not resulting in ACK (out of window). Regards Henrik ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: conntrack DoS 2002-06-25 22:42 ` Henrik Nordstrom @ 2002-06-27 8:26 ` Jozsef Kadlecsik 0 siblings, 0 replies; 33+ messages in thread From: Jozsef Kadlecsik @ 2002-06-27 8:26 UTC (permalink / raw) To: Henrik Nordstrom; +Cc: netfilter-devel On Wed, 26 Jun 2002, Henrik Nordstrom wrote: > A running TCP packet flow (even for a "half-closed" uni-directional TCP) > is never uni-directional. If there is data in flowing in one direction > then there is ACKs in the other direction. Yes, right. > Idea on how conntrack could deal with such connections: If several > retransmissions (lets say 5) is seen in one direction and no ACKs in the > other within a reasonable timeframe (lets say 10 minutes) then the TCP > is most likely dead and a low inactivity timeout can be assigned (lets > say 20 minutes) to have it cleaned out from conntrack. > > At a first glance this can be simplified into a RETRANSMIT/ACK timeout > state machinery, but there is a significant race window making a simple > packet driven state machine unsuitable. Must not trigger on a delayed > retransmission followed by a lost ACK, or delayed retransmissions not > resulting in ACK (out of window). I believe it is a good approach and can be implemented. But first the NOTRACK patch... Regards, Jozsef - E-mail : kadlec@blackhole.kfki.hu, kadlec@sunserv.kfki.hu WWW-Home: http://www.kfki.hu/~kadlec Address : KFKI Research Institute for Particle and Nuclear Physics H-1525 Budapest 114, POB. 49, Hungary ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2002-06-29 5:25 UTC | newest]
Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-06-20 19:48 performance issues (nat / conntrack) Jean-Michel Hemstedt
2002-06-22 16:51 ` Harald Welte
2002-06-23 9:15 ` Jean-Michel Hemstedt
2002-06-25 11:33 ` Jozsef Kadlecsik
2002-06-25 12:47 ` Harald Welte
2002-06-25 14:23 ` Jozsef Kadlecsik
[not found] ` <025001c21c50$763fa880$0489cb8a@etbx180>
2002-06-25 16:07 ` Harald Welte
2002-06-25 21:08 ` Jozsef Kadlecsik
2002-06-25 13:21 ` Jean-Michel Hemstedt
2002-06-25 13:51 ` Harald Welte
2002-06-25 14:33 ` Jozsef Kadlecsik
2002-06-25 14:51 ` Jean-Michel Hemstedt
2002-06-25 16:11 ` Harald Welte
2002-06-25 13:52 ` Patrick Schaaf
2002-06-25 14:53 ` Jozsef Kadlecsik
2002-06-25 15:22 ` Balazs Scheidler
2002-06-25 10:35 ` Jozsef Kadlecsik
2002-06-25 12:42 ` Jean-Michel Hemstedt
2002-06-25 13:50 ` Patrick Schaaf
2002-06-25 19:03 ` Harald Welte
2002-06-25 13:56 ` Alex Bennee
2002-06-25 14:17 ` Jozsef Kadlecsik
2002-06-25 15:13 ` Balazs Scheidler
2002-06-25 19:06 ` Harald Welte
2002-06-26 8:18 ` Balazs Scheidler
2002-06-27 2:21 ` Andrew Smith
2002-06-27 11:24 ` Harald Welte
2002-06-29 5:25 ` Andrew Smith
2002-06-25 19:01 ` Harald Welte
2002-06-25 20:53 ` conntrack DoS Henrik Nordstrom
2002-06-25 21:47 ` Jozsef Kadlecsik
2002-06-25 22:42 ` Henrik Nordstrom
2002-06-27 8:26 ` Jozsef Kadlecsik
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.