* Something hitting my total number of connections to the server @ 2017-08-14 9:07 Akshat Kakkar 2017-08-16 4:48 ` Akshat Kakkar 0 siblings, 1 reply; 26+ messages in thread From: Akshat Kakkar @ 2017-08-14 9:07 UTC (permalink / raw) To: netdev I have centos 7.3 (Kernel 3.10) running on a server with 128GB RAM and 2 x 10 Core Xeon Processor. I have hosted a webserver on it and enabled ssh for remote maintenance. Previously it was running on Centos 6.3. After upgrading to CentOS 7.3, occasionally (probably when number of hits are more on the server), I am not able to create new connections (neither on web nor on ssh). Existing connections keeps on running fine. I did packet capturing using tcpdump to understand if its some intermediate network issue. What I found was the server is not replying for new SYN requests. So it's clear that its not at all application issue. Also, there are no logs in applications logs for any connections dropped, if any. I check my firewall rules if there is some rate limiting imposed. There is nothing in there. I check tc, if by mistake some rate limiting is imposed. There is nothing in there too. I have increased noOfFiles to 1000000 and other sysctl parameters, but the issue is still there. Has anybody experienced the same? How to go about? Anybody ... Please Help!!! ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-14 9:07 Something hitting my total number of connections to the server Akshat Kakkar @ 2017-08-16 4:48 ` Akshat Kakkar 2017-08-16 10:34 ` Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: Akshat Kakkar @ 2017-08-16 4:48 UTC (permalink / raw) To: netdev On Mon, Aug 14, 2017 at 2:37 PM, Akshat Kakkar <akshat.1984@gmail.com> wrote: > I have centos 7.3 (Kernel 3.10) running on a server with 128GB RAM and > 2 x 10 Core Xeon Processor. > I have hosted a webserver on it and enabled ssh for remote maintenance. > Previously it was running on Centos 6.3. > After upgrading to CentOS 7.3, occasionally (probably when number of > hits are more on the server), I am not able to create new connections > (neither on web nor on ssh). Existing connections keeps on running > fine. > > I did packet capturing using tcpdump to understand if its some > intermediate network issue. > What I found was the server is not replying for new SYN requests. > > So it's clear that its not at all application issue. Also, there are > no logs in applications logs for any connections dropped, if any. > > I check my firewall rules if there is some rate limiting imposed. > There is nothing in there. > > I check tc, if by mistake some rate limiting is imposed. There is > nothing in there too. > > I have increased noOfFiles to 1000000 and other sysctl parameters, but > the issue is still there. > > Has anybody experienced the same? > > How to go about? Anybody ... Please Help!!! Its getting lonely out here. Anybody there ??? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-16 4:48 ` Akshat Kakkar @ 2017-08-16 10:34 ` Eric Dumazet 2017-08-17 9:05 ` Akshat Kakkar 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2017-08-16 10:34 UTC (permalink / raw) To: Akshat Kakkar; +Cc: netdev On Wed, 2017-08-16 at 10:18 +0530, Akshat Kakkar wrote: > On Mon, Aug 14, 2017 at 2:37 PM, Akshat Kakkar <akshat.1984@gmail.com> wrote: > > I have centos 7.3 (Kernel 3.10) running on a server with 128GB RAM and > > 2 x 10 Core Xeon Processor. > > I have hosted a webserver on it and enabled ssh for remote maintenance. > > Previously it was running on Centos 6.3. > > After upgrading to CentOS 7.3, occasionally (probably when number of > > hits are more on the server), I am not able to create new connections > > (neither on web nor on ssh). Existing connections keeps on running > > fine. > > > > I did packet capturing using tcpdump to understand if its some > > intermediate network issue. > > What I found was the server is not replying for new SYN requests. > > > > So it's clear that its not at all application issue. Also, there are > > no logs in applications logs for any connections dropped, if any. > > > > I check my firewall rules if there is some rate limiting imposed. > > There is nothing in there. > > > > I check tc, if by mistake some rate limiting is imposed. There is > > nothing in there too. > > > > I have increased noOfFiles to 1000000 and other sysctl parameters, but > > the issue is still there. > > > > Has anybody experienced the same? > > > > How to go about? Anybody ... Please Help!!! > > Its getting lonely out here. Anybody there ??? We wont help you unless you use a recent kernel. 3.10 misses all recent improvements in TCP stack (4 years of hard work) ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-16 10:34 ` Eric Dumazet @ 2017-08-17 9:05 ` Akshat Kakkar 2017-08-17 11:36 ` Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: Akshat Kakkar @ 2017-08-17 9:05 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Wed, Aug 16, 2017 at 4:04 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Wed, 2017-08-16 at 10:18 +0530, Akshat Kakkar wrote: >> On Mon, Aug 14, 2017 at 2:37 PM, Akshat Kakkar <akshat.1984@gmail.com> wrote: >> > I have centos 7.3 (Kernel 3.10) running on a server with 128GB RAM and >> > 2 x 10 Core Xeon Processor. >> > I have hosted a webserver on it and enabled ssh for remote maintenance. >> > Previously it was running on Centos 6.3. >> > After upgrading to CentOS 7.3, occasionally (probably when number of >> > hits are more on the server), I am not able to create new connections >> > (neither on web nor on ssh). Existing connections keeps on running >> > fine. >> > >> > I did packet capturing using tcpdump to understand if its some >> > intermediate network issue. >> > What I found was the server is not replying for new SYN requests. >> > >> > So it's clear that its not at all application issue. Also, there are >> > no logs in applications logs for any connections dropped, if any. >> > >> > I check my firewall rules if there is some rate limiting imposed. >> > There is nothing in there. >> > >> > I check tc, if by mistake some rate limiting is imposed. There is >> > nothing in there too. >> > >> > I have increased noOfFiles to 1000000 and other sysctl parameters, but >> > the issue is still there. >> > >> > Has anybody experienced the same? >> > >> > How to go about? Anybody ... Please Help!!! >> >> Its getting lonely out here. Anybody there ??? > > We wont help you unless you use a recent kernel. > > 3.10 misses all recent improvements in TCP stack (4 years of hard work) > > > > > I upgraded to 4.4 but still experiencing same issue. Please help. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-17 9:05 ` Akshat Kakkar @ 2017-08-17 11:36 ` Eric Dumazet 2017-08-18 9:14 ` Akshat Kakkar 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2017-08-17 11:36 UTC (permalink / raw) To: Akshat Kakkar; +Cc: netdev On Thu, 2017-08-17 at 14:35 +0530, Akshat Kakkar wrote: > I upgraded to 4.4 but still experiencing same issue. > Please help. Still too old kernel, shoot again ;) ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-17 11:36 ` Eric Dumazet @ 2017-08-18 9:14 ` Akshat Kakkar 2017-08-18 12:06 ` Eric Dumazet 2017-08-21 9:43 ` David Laight 0 siblings, 2 replies; 26+ messages in thread From: Akshat Kakkar @ 2017-08-18 9:14 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Thu, Aug 17, 2017 at 5:06 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Thu, 2017-08-17 at 14:35 +0530, Akshat Kakkar wrote: > >> I upgraded to 4.4 but still experiencing same issue. >> Please help. > > Still too old kernel, shoot again ;) > > Sorry but that's the maximum I can try as of now as its the LT version. Besides, this issue was not present in 2.6.32 but came with 3.10 and still there in 4.4, so I doubt if it has to do with some kernel and/or kernel parameters much as you guys are good enough not to keep an issue for so long (around 3 years). So please help. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-18 9:14 ` Akshat Kakkar @ 2017-08-18 12:06 ` Eric Dumazet 2017-08-18 12:44 ` Akshat Kakkar 2017-08-21 9:43 ` David Laight 1 sibling, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2017-08-18 12:06 UTC (permalink / raw) To: Akshat Kakkar; +Cc: netdev On Fri, 2017-08-18 at 14:44 +0530, Akshat Kakkar wrote: > On Thu, Aug 17, 2017 at 5:06 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > On Thu, 2017-08-17 at 14:35 +0530, Akshat Kakkar wrote: > > > >> I upgraded to 4.4 but still experiencing same issue. > >> Please help. > > > > Still too old kernel, shoot again ;) > > > > > > > Sorry but that's the maximum I can try as of now as its the LT version. > > Besides, this issue was not present in 2.6.32 but came with 3.10 and > still there in 4.4, so I doubt if it has to do with some kernel and/or > kernel parameters much as you guys are good enough not to keep an > issue for so long (around 3 years). > > So please help. netdev is the developer list. We deal with recent kernels only. Because we already spent time fixing all these issues, we are not going to spend time fixing old kernels. Please to your distro provider to backport the needed patches. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-18 12:06 ` Eric Dumazet @ 2017-08-18 12:44 ` Akshat Kakkar 2017-08-18 12:57 ` Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: Akshat Kakkar @ 2017-08-18 12:44 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Fri, Aug 18, 2017 at 5:36 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Fri, 2017-08-18 at 14:44 +0530, Akshat Kakkar wrote: >> On Thu, Aug 17, 2017 at 5:06 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> > On Thu, 2017-08-17 at 14:35 +0530, Akshat Kakkar wrote: >> > >> >> I upgraded to 4.4 but still experiencing same issue. >> >> Please help. >> > >> > Still too old kernel, shoot again ;) >> > >> > >> >> >> Sorry but that's the maximum I can try as of now as its the LT version. >> >> Besides, this issue was not present in 2.6.32 but came with 3.10 and >> still there in 4.4, so I doubt if it has to do with some kernel and/or >> kernel parameters much as you guys are good enough not to keep an >> issue for so long (around 3 years). >> >> So please help. > > netdev is the developer list. > > We deal with recent kernels only. Because we already spent time fixing > all these issues, we are not going to spend time fixing old kernels. > > Please to your distro provider to backport the needed patches. > > > I appreciate that. Can you just recall if there was any such issue which was fixed after 4.4. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-18 12:44 ` Akshat Kakkar @ 2017-08-18 12:57 ` Eric Dumazet 0 siblings, 0 replies; 26+ messages in thread From: Eric Dumazet @ 2017-08-18 12:57 UTC (permalink / raw) To: Akshat Kakkar; +Cc: netdev On Fri, 2017-08-18 at 18:14 +0530, Akshat Kakkar wrote: > On Fri, Aug 18, 2017 at 5:36 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > On Fri, 2017-08-18 at 14:44 +0530, Akshat Kakkar wrote: > >> On Thu, Aug 17, 2017 at 5:06 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > >> > On Thu, 2017-08-17 at 14:35 +0530, Akshat Kakkar wrote: > >> > > >> >> I upgraded to 4.4 but still experiencing same issue. > >> >> Please help. > >> > > >> > Still too old kernel, shoot again ;) > >> > > >> > > >> > >> > >> Sorry but that's the maximum I can try as of now as its the LT version. > >> > >> Besides, this issue was not present in 2.6.32 but came with 3.10 and > >> still there in 4.4, so I doubt if it has to do with some kernel and/or > >> kernel parameters much as you guys are good enough not to keep an > >> issue for so long (around 3 years). > >> > >> So please help. > > > > netdev is the developer list. > > > > We deal with recent kernels only. Because we already spent time fixing > > all these issues, we are not going to spend time fixing old kernels. > > > > Please to your distro provider to backport the needed patches. > > > > > > > I appreciate that. > Can you just recall if there was any such issue which was fixed after 4.4. More than one hundred patches yes. Sorry, someone else than me will have to build a list of these patches. ^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: Something hitting my total number of connections to the server 2017-08-18 9:14 ` Akshat Kakkar 2017-08-18 12:06 ` Eric Dumazet @ 2017-08-21 9:43 ` David Laight 2017-08-21 9:56 ` Akshat Kakkar 1 sibling, 1 reply; 26+ messages in thread From: David Laight @ 2017-08-21 9:43 UTC (permalink / raw) To: 'Akshat Kakkar', Eric Dumazet; +Cc: netdev From: Akshat Kakkar > Sent: 18 August 2017 10:14 > On Thu, Aug 17, 2017 at 5:06 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > On Thu, 2017-08-17 at 14:35 +0530, Akshat Kakkar wrote: > > > >> I upgraded to 4.4 but still experiencing same issue. > >> Please help. > > > > Still too old kernel, shoot again ;) > > > > > > > Sorry but that's the maximum I can try as of now as its the LT version. You should be able to build a current kernel and run it with your existing user space. David ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-21 9:43 ` David Laight @ 2017-08-21 9:56 ` Akshat Kakkar 2017-08-21 14:17 ` Neal Cardwell 2017-08-21 15:17 ` Eric Dumazet 0 siblings, 2 replies; 26+ messages in thread From: Akshat Kakkar @ 2017-08-21 9:56 UTC (permalink / raw) To: David Laight; +Cc: Eric Dumazet, netdev On Mon, Aug 21, 2017 at 3:13 PM, David Laight <David.Laight@aculab.com> wrote: > From: Akshat Kakkar >> Sent: 18 August 2017 10:14 >> On Thu, Aug 17, 2017 at 5:06 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> > On Thu, 2017-08-17 at 14:35 +0530, Akshat Kakkar wrote: >> > >> >> I upgraded to 4.4 but still experiencing same issue. >> >> Please help. >> > >> > Still too old kernel, shoot again ;) >> > >> > >> >> >> Sorry but that's the maximum I can try as of now as its the LT version. > > You should be able to build a current kernel and run it with your > existing user space. > > David > The issue is with tcp timestamp. When I am disabling it, things are working fine but when I enable the issue re-occurs. However, I am not seeing tcp timestamps on packet, even when it is enabled simply because my client doesn't support it. But the question is, if I my client doesnt support timestamp , why enabling timestamp on server side is creating an issue?? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-21 9:56 ` Akshat Kakkar @ 2017-08-21 14:17 ` Neal Cardwell 2017-08-21 15:17 ` Eric Dumazet 1 sibling, 0 replies; 26+ messages in thread From: Neal Cardwell @ 2017-08-21 14:17 UTC (permalink / raw) To: Akshat Kakkar; +Cc: David Laight, Eric Dumazet, netdev On Mon, Aug 21, 2017 at 5:56 AM, Akshat Kakkar <akshat.1984@gmail.com> wrote: > > The issue is with tcp timestamp. When I am disabling it, things are > working fine but when I enable the issue re-occurs. However, I am not > seeing tcp timestamps on packet, even when it is enabled simply > because my client doesn't support it. > > But the question is, if I my client doesnt support timestamp , why > enabling timestamp on server side is creating an issue?? To help shed light on this, you could try collecting and dumping the nstat counters when the system is in the mode where it is not creating/accepting new connections, e.g.: nstat > /dev/null && sleep 10 && nstat The sleep interval would need to be long enough to cover a failing client connect attempt. It would also be helpful to gather a tcpdump trace over the interval, to see if the server is sending a RST, SYN+ACK, or nothing. neal ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-21 9:56 ` Akshat Kakkar 2017-08-21 14:17 ` Neal Cardwell @ 2017-08-21 15:17 ` Eric Dumazet 2017-08-21 17:28 ` Akshat Kakkar 1 sibling, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2017-08-21 15:17 UTC (permalink / raw) To: Akshat Kakkar; +Cc: David Laight, netdev On Mon, 2017-08-21 at 15:26 +0530, Akshat Kakkar wrote: > On Mon, Aug 21, 2017 at 3:13 PM, David Laight <David.Laight@aculab.com> wrote: > > From: Akshat Kakkar > >> Sent: 18 August 2017 10:14 > >> On Thu, Aug 17, 2017 at 5:06 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > >> > On Thu, 2017-08-17 at 14:35 +0530, Akshat Kakkar wrote: > >> > > >> >> I upgraded to 4.4 but still experiencing same issue. > >> >> Please help. > >> > > >> > Still too old kernel, shoot again ;) > >> > > >> > > >> > >> > >> Sorry but that's the maximum I can try as of now as its the LT version. > > > > You should be able to build a current kernel and run it with your > > existing user space. > > > > David > > > > The issue is with tcp timestamp. When I am disabling it, things are > working fine but when I enable the issue re-occurs. However, I am not > seeing tcp timestamps on packet, even when it is enabled simply > because my client doesn't support it. > > But the question is, if I my client doesnt support timestamp , why > enabling timestamp on server side is creating an issue?? Maybe you changed some sysctls wrongly ? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-21 15:17 ` Eric Dumazet @ 2017-08-21 17:28 ` Akshat Kakkar 2017-08-21 17:44 ` Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: Akshat Kakkar @ 2017-08-21 17:28 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Laight, netdev On Monday, August 21, 2017, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > On Mon, 2017-08-21 at 15:26 +0530, Akshat Kakkar wrote: > > On Mon, Aug 21, 2017 at 3:13 PM, David Laight <David.Laight@aculab.com> wrote: > > > From: Akshat Kakkar > > >> Sent: 18 August 2017 10:14 > > >> On Thu, Aug 17, 2017 at 5:06 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > >> > On Thu, 2017-08-17 at 14:35 +0530, Akshat Kakkar wrote: > > >> > > > >> >> I upgraded to 4.4 but still experiencing same issue. > > >> >> Please help. > > >> > > > >> > Still too old kernel, shoot again ;) > > >> > > > >> > > > >> > > >> > > >> Sorry but that's the maximum I can try as of now as its the LT version. > > > > > > You should be able to build a current kernel and run it with your > > > existing user space. > > > > > > David > > > > > > > The issue is with tcp timestamp. When I am disabling it, things are > > working fine but when I enable the issue re-occurs. However, I am not > > seeing tcp timestamps on packet, even when it is enabled simply > > because my client doesn't support it. > > > > But the question is, if I my client doesnt support timestamp , why > > enabling timestamp on server side is creating an issue?? > > Maybe you changed some sysctls wrongly ? > > As mentioned in my initial description, the server is not sending SYN-ACK. Thats what the main symptom. For completeness, its not sending any RST also. However, if I disable TCP timestamp ... the server starts giving SYN-ACK. The strangest thing is, my client doesnt initiate a connection with tcp timestamp, so how come disabling tcp timestamp is making things work. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-21 17:28 ` Akshat Kakkar @ 2017-08-21 17:44 ` Eric Dumazet 2017-08-21 17:44 ` Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2017-08-21 17:44 UTC (permalink / raw) To: Akshat Kakkar; +Cc: David Laight, netdev On Mon, 2017-08-21 at 22:58 +0530, Akshat Kakkar wrote: > As mentioned in my initial description, the server is not sending > SYN-ACK. Thats what the main symptom. For completeness, its not > sending any RST also. > However, if I disable TCP timestamp ... the server starts giving SYN-ACK. > The strangest thing is, my client doesnt initiate a connection with > tcp timestamp, so how come disabling tcp timestamp is making things > work. As I said, maybe the bug was already fixed months ago. By running an old kernel, you want us to spend time on something that might already have been fixed. Only if you run a current kernel _and_ reproduce the problem, then we might take a look. I suspect your client is a single host ? - Why is timewait not being used ? - What sysctls have been changed on your server ? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-21 17:44 ` Eric Dumazet @ 2017-08-21 17:44 ` Eric Dumazet 2017-08-22 5:42 ` Akshat Kakkar 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2017-08-21 17:44 UTC (permalink / raw) To: Akshat Kakkar; +Cc: David Laight, netdev On Mon, 2017-08-21 at 10:44 -0700, Eric Dumazet wrote: > - Why is timewait not being used ? > s/timewait/timestamps/ ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-21 17:44 ` Eric Dumazet @ 2017-08-22 5:42 ` Akshat Kakkar 2017-08-22 5:44 ` Akshat Kakkar ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Akshat Kakkar @ 2017-08-22 5:42 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Laight, netdev There are multiple hosts/clients. All are mainly windows based. Timestamp is not used as my clients mainly are windows based and in that it tcp timestamp is by defauly disabled. sysctl is as follows: kernel.shmmax = 68719476736 kernel.shmall = 4294967296 kernel.pid_max=4194303 vm.max_map_count=131072 kernel.sem=250 32000 32 250 net.netfilter.nf_conntrack_generic_timeout = 300 net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 60 net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 30 net.netfilter.nf_conntrack_tcp_timeout_established = 7200 net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 60 net.netfilter.nf_conntrack_tcp_timeout_close_wait = 30 net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30 net.netfilter.nf_conntrack_tcp_timeout_time_wait = 60 net.netfilter.nf_conntrack_tcp_timeout_close = 10 net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300 net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300 net.netfilter.nf_conntrack_udp_timeout = 30 net.netfilter.nf_conntrack_udp_timeout_stream = 180 net.netfilter.nf_conntrack_icmp_timeout = 30 net.netfilter.nf_conntrack_events_retry_timeout = 15 net.core.rmem_max = 8388608 net.core.wmem_max = 8388608 net.ipv4.tcp_tw_reuse=1 net.ipv4.tcp_tw_recycle=1 net.ipv4.tcp_fin_timeout=30 net.ipv4.tcp_keepalive_time=1800 net.ipv4.tcp_keepalive_intvl=60 net.ipv4.tcp_keepalive_probes=20 net.ipv4.tcp_max_syn_backlog=4096 net.ipv4.tcp_syncookies=1 net.ipv4.tcp_sack=1 net.ipv4.tcp_dsack=1 net.ipv4.tcp_window_scaling=1 net.ipv4.tcp_syn_retries=3 net.ipv4.tcp_synack_retries=3 net.ipv4.tcp_retries1=3 net.ipv4.tcp_retries2=15 net.ipv4.ip_local_port_range=1024 65535 net.ipv4.tcp_timestamps=0 net.core.netdev_max_backlog=10000 net.core.somaxconn=100000 net.core.optmem_max=81920 net.netfilter.nf_conntrack_max=524288 net.nf_conntrack_max=524288 net.ipv6.conf.all.disable_ipv6 = 1 fs.file-max=1000000 net.ipv4.tcp_no_metrics_save = 1 net.ipv4.tcp_max_syn_backlog = 10240 net.ipv4.tcp_congestion_control=htcp net.ipv4.tcp_rfc1337 = 1 net.core.netdev_max_backlog = 65536 net.ipv4.tcp_max_tw_buckets = 1440000 net.core.rmem_max = 134217728 net.core.wmem_max = 134217728 On Mon, Aug 21, 2017 at 11:14 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Mon, 2017-08-21 at 10:44 -0700, Eric Dumazet wrote: > >> - Why is timewait not being used ? >> > > s/timewait/timestamps/ > > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-22 5:42 ` Akshat Kakkar @ 2017-08-22 5:44 ` Akshat Kakkar 2017-08-22 12:28 ` Neal Cardwell 2017-08-22 13:02 ` Eric Dumazet 2 siblings, 0 replies; 26+ messages in thread From: Akshat Kakkar @ 2017-08-22 5:44 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Laight, netdev On Tue, Aug 22, 2017 at 11:12 AM, Akshat Kakkar <akshat.1984@gmail.com> wrote: > There are multiple hosts/clients. All are mainly windows based. > > Timestamp is not used as my clients mainly are windows based and in > that it tcp timestamp is by defauly disabled. > > sysctl is as follows: > > kernel.shmmax = 68719476736 > kernel.shmall = 4294967296 > kernel.pid_max=4194303 > vm.max_map_count=131072 > kernel.sem=250 32000 32 250 > > net.netfilter.nf_conntrack_generic_timeout = 300 > net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 60 > net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 30 > net.netfilter.nf_conntrack_tcp_timeout_established = 7200 > net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 60 > net.netfilter.nf_conntrack_tcp_timeout_close_wait = 30 > net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30 > net.netfilter.nf_conntrack_tcp_timeout_time_wait = 60 > net.netfilter.nf_conntrack_tcp_timeout_close = 10 > net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300 > net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300 > net.netfilter.nf_conntrack_udp_timeout = 30 > net.netfilter.nf_conntrack_udp_timeout_stream = 180 > net.netfilter.nf_conntrack_icmp_timeout = 30 > net.netfilter.nf_conntrack_events_retry_timeout = 15 > net.core.rmem_max = 8388608 > net.core.wmem_max = 8388608 > > net.ipv4.tcp_tw_reuse=1 > net.ipv4.tcp_tw_recycle=1 > net.ipv4.tcp_fin_timeout=30 > net.ipv4.tcp_keepalive_time=1800 > net.ipv4.tcp_keepalive_intvl=60 > net.ipv4.tcp_keepalive_probes=20 > net.ipv4.tcp_max_syn_backlog=4096 > net.ipv4.tcp_syncookies=1 > net.ipv4.tcp_sack=1 > net.ipv4.tcp_dsack=1 > net.ipv4.tcp_window_scaling=1 > net.ipv4.tcp_syn_retries=3 > net.ipv4.tcp_synack_retries=3 > net.ipv4.tcp_retries1=3 > net.ipv4.tcp_retries2=15 > net.ipv4.ip_local_port_range=1024 65535 > > net.ipv4.tcp_timestamps=0 > > net.core.netdev_max_backlog=10000 > net.core.somaxconn=100000 > net.core.optmem_max=81920 > > net.netfilter.nf_conntrack_max=524288 > net.nf_conntrack_max=524288 > net.ipv6.conf.all.disable_ipv6 = 1 > fs.file-max=1000000 > > net.ipv4.tcp_no_metrics_save = 1 > net.ipv4.tcp_max_syn_backlog = 10240 > net.ipv4.tcp_congestion_control=htcp > > net.ipv4.tcp_rfc1337 = 1 > net.core.netdev_max_backlog = 65536 > net.ipv4.tcp_max_tw_buckets = 1440000 > > net.core.rmem_max = 134217728 > net.core.wmem_max = 134217728 > > > > > On Mon, Aug 21, 2017 at 11:14 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> On Mon, 2017-08-21 at 10:44 -0700, Eric Dumazet wrote: >> >>> - Why is timewait not being used ? >>> >> >> s/timewait/timestamps/ >> >> >> [Apologies for top post.] There are multiple hosts/clients. All are mainly windows based. Timestamp is not used as my clients mainly are windows based and in that it tcp timestamp is by defauly disabled. sysctl is as follows: kernel.shmmax = 68719476736 kernel.shmall = 4294967296 kernel.pid_max=4194303 vm.max_map_count=131072 kernel.sem=250 32000 32 250 net.netfilter.nf_conntrack_generic_timeout = 300 net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 60 net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 30 net.netfilter.nf_conntrack_tcp_timeout_established = 7200 net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 60 net.netfilter.nf_conntrack_tcp_timeout_close_wait = 30 net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30 net.netfilter.nf_conntrack_tcp_timeout_time_wait = 60 net.netfilter.nf_conntrack_tcp_timeout_close = 10 net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300 net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300 net.netfilter.nf_conntrack_udp_timeout = 30 net.netfilter.nf_conntrack_udp_timeout_stream = 180 net.netfilter.nf_conntrack_icmp_timeout = 30 net.netfilter.nf_conntrack_events_retry_timeout = 15 net.core.rmem_max = 8388608 net.core.wmem_max = 8388608 net.ipv4.tcp_tw_reuse=1 net.ipv4.tcp_tw_recycle=1 net.ipv4.tcp_fin_timeout=30 net.ipv4.tcp_keepalive_time=1800 net.ipv4.tcp_keepalive_intvl=60 net.ipv4.tcp_keepalive_probes=20 net.ipv4.tcp_max_syn_backlog=4096 net.ipv4.tcp_syncookies=1 net.ipv4.tcp_sack=1 net.ipv4.tcp_dsack=1 net.ipv4.tcp_window_scaling=1 net.ipv4.tcp_syn_retries=3 net.ipv4.tcp_synack_retries=3 net.ipv4.tcp_retries1=3 net.ipv4.tcp_retries2=15 net.ipv4.ip_local_port_range=1024 65535 net.ipv4.tcp_timestamps=0 net.core.netdev_max_backlog=10000 net.core.somaxconn=100000 net.core.optmem_max=81920 net.netfilter.nf_conntrack_max=524288 net.nf_conntrack_max=524288 net.ipv6.conf.all.disable_ipv6 = 1 fs.file-max=1000000 net.ipv4.tcp_no_metrics_save = 1 net.ipv4.tcp_max_syn_backlog = 10240 net.ipv4.tcp_congestion_control=htcp net.ipv4.tcp_rfc1337 = 1 net.core.netdev_max_backlog = 65536 net.ipv4.tcp_max_tw_buckets = 1440000 net.core.rmem_max = 134217728 net.core.wmem_max = 134217728 ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-22 5:42 ` Akshat Kakkar 2017-08-22 5:44 ` Akshat Kakkar @ 2017-08-22 12:28 ` Neal Cardwell 2017-08-23 5:08 ` Akshat Kakkar 2017-08-22 13:02 ` Eric Dumazet 2 siblings, 1 reply; 26+ messages in thread From: Neal Cardwell @ 2017-08-22 12:28 UTC (permalink / raw) To: Akshat Kakkar; +Cc: Eric Dumazet, David Laight, netdev On Tue, Aug 22, 2017 at 1:42 AM, Akshat Kakkar <akshat.1984@gmail.com> wrote: > There are multiple hosts/clients. All are mainly windows based. > > Timestamp is not used as my clients mainly are windows based and in > that it tcp timestamp is by defauly disabled. ... > net.ipv4.tcp_tw_reuse=1 > net.ipv4.tcp_tw_recycle=1 I suspect the problem is there. The net.ipv4.tcp_tw_recycle setting should be 0. Running with the value 1 is known to cause buggy behavior related to TCP timestamps, and that feature has been removed in kernel v4.12. Can you please re-run your tests with net.ipv4.tcp_tw_recycle=0 or a newer kernel? neal ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-22 12:28 ` Neal Cardwell @ 2017-08-23 5:08 ` Akshat Kakkar 2017-08-23 13:35 ` Neal Cardwell 0 siblings, 1 reply; 26+ messages in thread From: Akshat Kakkar @ 2017-08-23 5:08 UTC (permalink / raw) To: Neal Cardwell; +Cc: Eric Dumazet, David Laight, netdev On Tue, Aug 22, 2017 at 5:58 PM, Neal Cardwell <ncardwell@google.com> wrote: > On Tue, Aug 22, 2017 at 1:42 AM, Akshat Kakkar <akshat.1984@gmail.com> wrote: >> There are multiple hosts/clients. All are mainly windows based. >> >> Timestamp is not used as my clients mainly are windows based and in >> that it tcp timestamp is by defauly disabled. > ... >> net.ipv4.tcp_tw_reuse=1 >> net.ipv4.tcp_tw_recycle=1 > > I suspect the problem is there. The net.ipv4.tcp_tw_recycle setting > should be 0. Running with the value 1 is known to cause buggy behavior > related to TCP timestamps, and that feature has been removed in kernel > v4.12. > > Can you please re-run your tests with net.ipv4.tcp_tw_recycle=0 or a > newer kernel? > > neal Thanks for your reply. I understand that. But my point is, though tcp timestamp is enabled on the server, but as client is not using it ... so how come this _bug_ (if any) is triggered in first place. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-23 5:08 ` Akshat Kakkar @ 2017-08-23 13:35 ` Neal Cardwell 0 siblings, 0 replies; 26+ messages in thread From: Neal Cardwell @ 2017-08-23 13:35 UTC (permalink / raw) To: Akshat Kakkar; +Cc: Eric Dumazet, David Laight, netdev On Wed, Aug 23, 2017 at 1:08 AM, Akshat Kakkar <akshat.1984@gmail.com> wrote: > > On Tue, Aug 22, 2017 at 5:58 PM, Neal Cardwell <ncardwell@google.com> wrote: > > On Tue, Aug 22, 2017 at 1:42 AM, Akshat Kakkar <akshat.1984@gmail.com> wrote: > >> There are multiple hosts/clients. All are mainly windows based. > >> > >> Timestamp is not used as my clients mainly are windows based and in > >> that it tcp timestamp is by defauly disabled. > > ... > >> net.ipv4.tcp_tw_reuse=1 > >> net.ipv4.tcp_tw_recycle=1 > > > > I suspect the problem is there. The net.ipv4.tcp_tw_recycle setting > > should be 0. Running with the value 1 is known to cause buggy behavior > > related to TCP timestamps, and that feature has been removed in kernel > > v4.12. > > > > Can you please re-run your tests with net.ipv4.tcp_tw_recycle=0 or a > > newer kernel? > > > > neal > > Thanks for your reply. > > I understand that. > > But my point is, though tcp timestamp is enabled on the server, but as > client is not using it ... so how come this _bug_ (if any) is > triggered in first place. You mention "clients mainly are windows based". if they are only "mainly" Windows-based, and some are of other OSes that do use TCP timestamps, and the remote address is the same for TCP-timestamp-using and non-TCP-timestamp-using clients, then running with timestamps enabled on the server could tickle the bugs in pre-4.12 kernels that save info from TCP-timestamp-using connections and erroneously try to use that info to validate non-TCP-timestamp-using connections. But the main point is that the configuration you cited (net.ipv4.tcp_tw_recycle=1) is an unsupported configuration with known bugs. The best resolution would be to just run with net.ipv4.tcp_tw_recycle=0. It's not worth digging any further unless you run with net.ipv4.tcp_tw_recycle=0 or a kernel that is v4.12 or later and still have problems. Hope that helps, neal ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-22 5:42 ` Akshat Kakkar 2017-08-22 5:44 ` Akshat Kakkar 2017-08-22 12:28 ` Neal Cardwell @ 2017-08-22 13:02 ` Eric Dumazet 2017-08-22 16:43 ` David Ahern 2 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2017-08-22 13:02 UTC (permalink / raw) To: Akshat Kakkar; +Cc: David Laight, netdev On Tue, 2017-08-22 at 11:12 +0530, Akshat Kakkar wrote: > There are multiple hosts/clients. All are mainly windows based. > > Timestamp is not used as my clients mainly are windows based and in > that it tcp timestamp is by defauly disabled. > > sysctl is as follows: > > kernel.shmmax = 68719476736 > kernel.shmall = 4294967296 > kernel.pid_max=4194303 > vm.max_map_count=131072 > kernel.sem=250 32000 32 250 > > net.netfilter.nf_conntrack_generic_timeout = 300 > net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 60 > net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 30 > net.netfilter.nf_conntrack_tcp_timeout_established = 7200 > net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 60 > net.netfilter.nf_conntrack_tcp_timeout_close_wait = 30 > net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30 > net.netfilter.nf_conntrack_tcp_timeout_time_wait = 60 > net.netfilter.nf_conntrack_tcp_timeout_close = 10 > net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300 > net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300 > net.netfilter.nf_conntrack_udp_timeout = 30 > net.netfilter.nf_conntrack_udp_timeout_stream = 180 > net.netfilter.nf_conntrack_icmp_timeout = 30 > net.netfilter.nf_conntrack_events_retry_timeout = 15 > net.core.rmem_max = 8388608 > net.core.wmem_max = 8388608 > > net.ipv4.tcp_tw_reuse=1 > net.ipv4.tcp_tw_recycle=1 This is exactly what I feared. We do not support tcp_tw_reuse = 1 AND tcp_tw_recycle = 1 This is a very well known bad combination. > net.ipv4.tcp_fin_timeout=30 > net.ipv4.tcp_keepalive_time=1800 > net.ipv4.tcp_keepalive_intvl=60 > net.ipv4.tcp_keepalive_probes=20 > net.ipv4.tcp_max_syn_backlog=4096 > net.ipv4.tcp_syncookies=1 > net.ipv4.tcp_sack=1 > net.ipv4.tcp_dsack=1 > net.ipv4.tcp_window_scaling=1 > net.ipv4.tcp_syn_retries=3 > net.ipv4.tcp_synack_retries=3 > net.ipv4.tcp_retries1=3 > net.ipv4.tcp_retries2=15 > net.ipv4.ip_local_port_range=1024 65535 > > net.ipv4.tcp_timestamps=0 > > net.core.netdev_max_backlog=10000 This is an insane backlog. > net.core.somaxconn=100000 > net.core.optmem_max=81920 > > net.netfilter.nf_conntrack_max=524288 > net.nf_conntrack_max=524288 > net.ipv6.conf.all.disable_ipv6 = 1 > fs.file-max=1000000 > > net.ipv4.tcp_no_metrics_save = 1 > net.ipv4.tcp_max_syn_backlog = 10240 > net.ipv4.tcp_congestion_control=htcp > > net.ipv4.tcp_rfc1337 = 1 > net.core.netdev_max_backlog = 65536 This is a crazy backlog. Do not do that. > net.ipv4.tcp_max_tw_buckets = 1440000 > > net.core.rmem_max = 134217728 > net.core.wmem_max = 134217728 > > > It looks like your sysctls have been set to unreasonable values. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-22 13:02 ` Eric Dumazet @ 2017-08-22 16:43 ` David Ahern 2017-08-22 17:44 ` Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: David Ahern @ 2017-08-22 16:43 UTC (permalink / raw) To: Eric Dumazet, Akshat Kakkar; +Cc: David Laight, netdev On 8/22/17 6:02 AM, Eric Dumazet wrote: >> >> net.core.netdev_max_backlog=10000 > This is an insane backlog. > https://www.kernel.org/doc/Documentation/networking/scaling.txt "== Suggested Configuration Flow limit is useful on systems with many concurrent connections, where a single connection taking up 50% of a CPU indicates a problem. In such environments, enable the feature on all CPUs that handle network rx interrupts (as set in /proc/irq/N/smp_affinity). The feature depends on the input packet queue length to exceed the flow limit threshold (50%) + the flow history length (256). Setting net.core.netdev_max_backlog to either 1000 or 10000 performed well in experiments." ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-22 16:43 ` David Ahern @ 2017-08-22 17:44 ` Eric Dumazet 2017-08-22 17:46 ` David Ahern 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2017-08-22 17:44 UTC (permalink / raw) To: David Ahern; +Cc: Akshat Kakkar, David Laight, netdev, Willem de Bruijn On Tue, 2017-08-22 at 09:43 -0700, David Ahern wrote: > On 8/22/17 6:02 AM, Eric Dumazet wrote: > >> > >> net.core.netdev_max_backlog=10000 > > This is an insane backlog. > > > > https://www.kernel.org/doc/Documentation/networking/scaling.txt > > "== Suggested Configuration > > Flow limit is useful on systems with many concurrent connections, > where a single connection taking up 50% of a CPU indicates a problem. > In such environments, enable the feature on all CPUs that handle > network rx interrupts (as set in /proc/irq/N/smp_affinity). > > The feature depends on the input packet queue length to exceed > the flow limit threshold (50%) + the flow history length (256). > Setting net.core.netdev_max_backlog to either 1000 or 10000 > performed well in experiments." 10000 is adding tail latencies. At Google we run all the fleet with backlog of 1000 And yes, it took time to get rid of the backlog of 10000 that was setup years ago, because of old constraints and some fears. Willem wrote this doc in 2013, before we finally went back to 1000. We should update this doc. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-22 17:44 ` Eric Dumazet @ 2017-08-22 17:46 ` David Ahern 2017-08-22 17:50 ` Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: David Ahern @ 2017-08-22 17:46 UTC (permalink / raw) To: Eric Dumazet; +Cc: Akshat Kakkar, David Laight, netdev, Willem de Bruijn On 8/22/17 10:44 AM, Eric Dumazet wrote: > Willem wrote this doc in 2013, before we finally went back to 1000. > > We should update this doc. And these too: $ egrep -r netdev_max_backlog Documentation/networking/ Documentation/networking//cxgb.txt: sysctl -w net.core.netdev_max_backlog=300000 Documentation/networking//ixgb.txt:net.core.netdev_max_backlog = 300000 ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: Something hitting my total number of connections to the server 2017-08-22 17:46 ` David Ahern @ 2017-08-22 17:50 ` Eric Dumazet 0 siblings, 0 replies; 26+ messages in thread From: Eric Dumazet @ 2017-08-22 17:50 UTC (permalink / raw) To: David Ahern; +Cc: Akshat Kakkar, David Laight, netdev, Willem de Bruijn On Tue, 2017-08-22 at 10:46 -0700, David Ahern wrote: > On 8/22/17 10:44 AM, Eric Dumazet wrote: > > Willem wrote this doc in 2013, before we finally went back to 1000. > > > > We should update this doc. > > > And these too: > > $ egrep -r netdev_max_backlog Documentation/networking/ > Documentation/networking//cxgb.txt: sysctl -w > net.core.netdev_max_backlog=300000 > Documentation/networking//ixgb.txt:net.core.netdev_max_backlog = 300000 Yes, whoever wrote this had no idea of the implications I guess. ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2017-08-23 13:36 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-08-14 9:07 Something hitting my total number of connections to the server Akshat Kakkar 2017-08-16 4:48 ` Akshat Kakkar 2017-08-16 10:34 ` Eric Dumazet 2017-08-17 9:05 ` Akshat Kakkar 2017-08-17 11:36 ` Eric Dumazet 2017-08-18 9:14 ` Akshat Kakkar 2017-08-18 12:06 ` Eric Dumazet 2017-08-18 12:44 ` Akshat Kakkar 2017-08-18 12:57 ` Eric Dumazet 2017-08-21 9:43 ` David Laight 2017-08-21 9:56 ` Akshat Kakkar 2017-08-21 14:17 ` Neal Cardwell 2017-08-21 15:17 ` Eric Dumazet 2017-08-21 17:28 ` Akshat Kakkar 2017-08-21 17:44 ` Eric Dumazet 2017-08-21 17:44 ` Eric Dumazet 2017-08-22 5:42 ` Akshat Kakkar 2017-08-22 5:44 ` Akshat Kakkar 2017-08-22 12:28 ` Neal Cardwell 2017-08-23 5:08 ` Akshat Kakkar 2017-08-23 13:35 ` Neal Cardwell 2017-08-22 13:02 ` Eric Dumazet 2017-08-22 16:43 ` David Ahern 2017-08-22 17:44 ` Eric Dumazet 2017-08-22 17:46 ` David Ahern 2017-08-22 17:50 ` Eric Dumazet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox