From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Preston A. Elder" Subject: [Fwd: Re: [netfilter-core] iptables/conntrack in enterprise environment.] Date: 04 Jun 2003 16:18:52 -0400 Sender: netfilter-devel-admin@lists.netfilter.org Message-ID: <1054757932.9533.29.camel@clueless> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-bX/gbDZU9hi+uCVI7wMS" Return-path: Errors-To: netfilter-devel-admin@lists.netfilter.org List-Help: List-Post: List-Subscribe: , List-Id: List-Unsubscribe: , List-Archive: To: netfilter@lists.netfilter.org, netfilter-devel@lists.netfilter.org, coreteam@netfilter.org --=-bX/gbDZU9hi+uCVI7wMS Content-Type: text/plain Content-Transfer-Encoding: 7bit Hi, While waiting for a response to the email I sent below, I went ahead and investigated the 3rd 'question' I raised in that email. I essentially removed the checking of ip_nat_used_tuple for DNAT (which appplies to REDIRECT entries too) entries. I changed this code in get_unique_tuple (ip_nat_core.c) from this: if ((!(rptr->flags & IP_NAT_RANGE_PROTO_SPECIFIED) || proto->in_range(tuple, HOOK2MANIP(hooknum), &rptr->min, &rptr->max)) && !ip_nat_used_tuple(tuple, conntrack)) { ret = 1; goto clear_fulls; } else { to this: if ((!(rptr->flags & IP_NAT_RANGE_PROTO_SPECIFIED) || proto->in_range(tuple, HOOK2MANIP(hooknum), &rptr->min, &rptr->max)) && (HOOK2MANIP(hooknum) == IP_NAT_MANIP_DST ? 1 : !ip_nat_used_tuple(tuple, conntrack))) { ret = 1; goto clear_fulls; } else { I commented out the ASSERT's just after the proto->unique_tuple calls in get_unique_tuple (ip_nat_core.c) aswell, the lines that look like this: IP_NF_ASSERT(!ip_nat_used_tuple (tuple, conntrack)); And changed this code in tcp_unique_tuple (ip_nat_proto_tcp.c) from this: for (i = 0; i < range_size; i++, port++) { *portptr = htons(min + port % range_size); if (!ip_nat_used_tuple(tuple, conntrack)) { return 1; } } to this: if (maniptype == IP_NAT_MANIP_DST) { *portptr = htons(min + net_random() % range_size); return 1; } else { start = net_random() % range_size; port += start; for (i = start; i < range_size; i++, port++) { *portptr = htons(min + port % range_size); if (!ip_nat_used_tuple(tuple, conntrack)) { return 1; } } if (i == range_size) { port -= range_size; for (i = 0; i < start; i++, port++) { *portptr = htons(min + port % range_size); if (!ip_nat_used_tuple(tuple, conntrack)) { return 1; } } } } I only have one rule in the entire NAT table, the one that forwards all new connections to ports for machines behind it to a specific port range on the local machine (which is in the PREROUTING 'chain'). This change DOES seem to have the desired effect, of making connections fully establish pretty much immediately, and as suspected, since the socket on the local machine is just a listening socket, it really does not care about multiple connections, and thus does not need the 'in use' checking above. However, after putting this in place, the system seems to stop functioning (I'm not sure if its just the network, or the system itself, since I'm not at the console, however I suspect its the system itself, as its a very sudden freeze). Could someone shed some light into why the system would freeze after a short period of time (less than 5 minutes) with this code running (note: it only freezes when our application is running, ie. there is something there to accept connections). And also possibly shed some light on possible side-effects the above modifications could have (apart from freezing the system)? I don't usually screw around with the kernel (though I have before), so this is relatively new territory for me. Any and all help, comments, etc. appreciated. Thanks, PreZ :) --=-bX/gbDZU9hi+uCVI7wMS Content-Disposition: inline Content-Description: Forwarded message - Re: [netfilter-core] iptables/conntrack in enterprise environment. Content-Type: message/rfc822 From: "Preston A. Elder" Organization: Shadow Realm X-KMail-Identity: 1534576354 To: Harald Welte Subject: Re: [netfilter-core] iptables/conntrack in enterprise environment. Date: Fri, 30 May 2003 20:55:25 -0400 User-Agent: KMail/1.5.2 Cc: netfilter@lists.netfilter.org, netfilter-devel@lists.netfilter.org, coreteam@netfilter.org References: <200305290113.58552.prez@srealm.net.au> <200305300133.44822.prez@srealm.net.au> <20030530194240.GI29312@sunbeam.de.gnumonks.org> In-Reply-To: <20030530194240.GI29312@sunbeam.de.gnumonks.org> X-KMail-Link-Message: 2030304710 X-KMail-Link-Type: reply MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Description: clearsigned data Content-Disposition: inline Message-Id: <200305302055.40839.prez@srealm.net.au> X-KMail-EncryptionState: N X-KMail-SignatureState: F Content-Type: text/plain; CHARSET=iso-8859-1 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Friday 30 May 2003 03:42 pm, Harald Welte wrote: > > If I then telnet to 10.0.0.1 on port 5050, the connection is immediate, > > and my application receives a new connection on port 5050. If, however= , > > I telnet to 10.0.0.1 on port 5150, there is a small (but noticable) del= ay > > between when the telnet session shows the connection as established, an= d > > when the application receives the connection (on a random port between > > 5000 and 5100 inclusive). > > This is _definitely_ not a netfilter issue then. We are not doing > transparent proxying but nat. netfilter/iptables _never_ accept > connections on their own. So you open a connection: Maybe not, but thats the behavior I'm seeing. But this doesnt bother me so= =20 much as it takes a LOT longer for a connection to be established when=20 connecing to a port outside the range, than to one inside the range. > 1. telnet 10.0.0.1 5150 > 2. syn packet is sent by telnet > 3. syn packet is DNAT'ed by netfilter > 4. syn packet arrives at server application > 5. syn/ack packet is sent by server application > 6. syn/ack packet is SNAT'ed by netfilter > 7. syn/ack packet is received by telnet > [further handshake goes on] > x. telnet application prints 'Connection established' (connect(2) call > returns) > > This is a fundamental tcp/ip operation, and it can certainly by no way > be anything that netfilter does, that would introduce a behaviour like > > 1. telnet 10.0.01 5150 > 2. telnet shows 'connection established' > 3. connection 'arrives' at server > > [which is what you have been describing, If I understood you correctly]. I realise this is supposed to happen, however again, its not what I'm=20 observing, though its only noticable under high loads, so maybe its taking = a=20 long time to get the initial SYN and ACK to the server from telnet (which=20 would account for both the delay before connection is established by telnet= =20 (delay for SYN/ACK to get back from server), and the delay between this and= =20 the server showing the connection established (which only happens after it=20 gets the ACK). OK, so with this information, its taking a LONG time to get the SYN and ACK= to=20 the server when I'm connecting to a port outside the range I'm listening on= =20 with the application, causing the appearance of the port being connected by= =20 telnet a long time before the server shows it being established. In any=20 case, this doesnt change the problem. This delay does *NOT* exist when=20 connecting to a port that is in the listening range (remember that all=20 connections being discussed here are going to systems BEHIND the router box= ). I did some kernel diving, I've iscolated the place where the behavior=20 differentiates between a connection 'in' and 'out' of the port range, at th= is=20 if statement in get_unique_tuple (ip_nat_core.c): if ((!(rptr->flags & IP_NAT_RANGE_PROTO_SPECIFIED) || proto->in_range(tuple, HOOK2MANIP(hooknum), &rptr->min, &rptr->max)) && !ip_nat_used_tuple(tuple, conntrack)) { Obviously, anything connecting to a port outside this range will fail the=20 above if. If this fails, it will lead us to do this (one way or another): if (proto->unique_tuple(tuple, rptr, HOOK2MANIP(hooknum), conntrack)) { The unique_tuple function is where the delay is. In this case, its a TCP=20 connection, which translates to tcp_unique_tuple (ip_nat_proto_tcp.c), whic= h=20 after determining the port range, does the following: for (i =3D 0; i < range_size; i++, port++) { *portptr =3D htons(min + port % range_size); if (!ip_nat_used_tuple(tuple, conntrack)) { return 1; } } Now I'm assuming it will use this: min =3D ntohs(range->min.tcp.port); range_size =3D ntohs(range->max.tcp.port) - min + 1; rather than this: min =3D 1024; range_size =3D 65535 - 1024 + 1; for my port range, since I specified a local port range, however either way= ,=20 the range is about 100 ports. Which means the above nat used lookup happen= s=20 100 times (as opposed to once for when the port is inside the range). And=20 since ip_nat_used_tuple (ip_nat_core.c) calls ip_conntrack_tuple_taken=20 (ip_conntrack_core.c), which subsequently locks/unlocks the entire conntrac= k=20 table, and calls __ip_conntrack_find (which is a list lookup!), the above f= or=20 loop is a very intensive process. Now, after all that. A few questions arise: 1) Is there a way the above for loop could lock the conntrack table=20 beforehand, do the searches (not locked) and unlock afterwards. This is=20 probably the biggest thing consuming all the speed, constant=20 locking/unlocking, especially when there are many new socket connections=20 coming in every second. 2) Is there a way the above for loop could use a random 'start' place, sear= ch=20 through to the end, and then start at the beginning, until it ends up back = at=20 the same place it started from, which would avoid it constantly treading th= e=20 same ground (if the last entry used 1, its pretty assured 1 is still going = to=20 be in use when you get the next entry, so you could either skip directly to= =20 2, or randomly choose a new position to start at). (ie. something like: int port, start_port =3D (rand() % range) + min_port); for (port=3Dstart_port; port