* Preventing SSH timeouts . Some clarification needed @ 2010-06-08 9:36 query 2010-06-08 9:48 ` Michal Nazarewicz 2010-06-08 10:39 ` Glynn Clements 0 siblings, 2 replies; 16+ messages in thread From: query @ 2010-06-08 9:36 UTC (permalink / raw) To: linux-admin Hi , We are seeing some dropped SSH connections because of which some of the process are failing . The main likely reason for the connection drops is that both the client and server remains 100% busy during a certain time interval and during that time interval we see those occassional connection closed by the server. === ssh_exchange_identification: Connection closed by remote host^M Return_status (65280) Exit_value (255) End_time (03 Jun 2010 22:41:41) ==== One work around I could see is adding a timeout value using ClientAliveInterval option in /etc/ssh/sshd_config on the server side . Assume I have set the timeout value to 300 . " The above option as per the sshd man page tells that it sets a timeout interval in seconds after which if no data has been received from the client, sshd(8) will send a message through the encrypted channel to request a response from the client. " Let's take a situation where the SSH client is 100% busy or idle and it had communicated to the server for around 300 seconds , then in this case if the above option is there , the server should send a message to the client after 300 secs . The following two scenarios are coming to my mind. 1) if the server is also 100% busy during that time and could not send the message to the client , will the ssh connection will be dropped . 2) or Suppose the server was somewhat free after 350 secs , in that case will it drop the connection or it will send a message to the client to check whether the client is active or not since it could not send the message at 300 s as it was busy during the time . Please clarify the ssh behaviour for the above scenarious . I hope I am clear with the question and the above scenarious makes sense . Thanks Zaman ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Preventing SSH timeouts . Some clarification needed 2010-06-08 9:36 Preventing SSH timeouts . Some clarification needed query @ 2010-06-08 9:48 ` Michal Nazarewicz [not found] ` <AANLkTimTjSOmbj_ac4iiUMaHRuvp1-ljW-FUGAQbb1qt@mail.gmail.com> 2010-06-08 10:39 ` Glynn Clements 1 sibling, 1 reply; 16+ messages in thread From: Michal Nazarewicz @ 2010-06-08 9:48 UTC (permalink / raw) To: query; +Cc: linux-admin [-- Attachment #1: Type: text/plain, Size: 1331 bytes --] query <query.cdac@gmail.com> writes: > We are seeing some dropped SSH connections because of which some of > the process are failing . The main likely reason for the connection > drops is that both the client and server remains 100% busy during > a certain time interval and during that time interval we see those > occassional connection closed by the server. Are you sure it's not because of some NATing which may have a shorter timeout then the one used by SSH's keep alive? > Let's take a situation where the SSH client is 100% busy or idle and > it had communicated to the server for around 300 seconds , then in > this case if the above option is there , the server should send a > message to the client after 300 secs. [...] 300 seconds is a very long time. I consider it unlikely that a process that was idle for 300 seconds wouldn't quickly get a few CPU cycles just to send a simple packet. I find it possible only when you use non-standard schedule policies. Hope it helps even that I haven't answered your original question. -- Best regards, _ _ .o. | Liege of Serenly Enlightened Majesty of o' \,=./ `o ..o | Computer Science, Michal "mina86" Nazarewicz (o o) ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo-- [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <AANLkTimTjSOmbj_ac4iiUMaHRuvp1-ljW-FUGAQbb1qt@mail.gmail.com>]
[parent not found: <87k4q9g31y.fsf@erwin.mina86.com>]
* Re: Preventing SSH timeouts . Some clarification needed [not found] ` <87k4q9g31y.fsf@erwin.mina86.com> @ 2010-06-08 15:10 ` query 2010-06-08 19:48 ` Michal Nazarewicz 0 siblings, 1 reply; 16+ messages in thread From: query @ 2010-06-08 15:10 UTC (permalink / raw) To: Michal Nazarewicz; +Cc: linux-admin 2010/6/8 Michal Nazarewicz <mina86@tlen.pl>: >> 2010/6/8 Michal Nazarewicz <mina86@tlen.pl>: >>> Are you sure it's not because of some NATing which may have a shorter >>> timeout then the one used by SSH's keep alive? > > query <query.cdac@gmail.com> writes: >> I am not 100% sure but during the connection dropout time , the CPU >> is 100% busy as shown >> by our own reporting utility. Reg NAT ing , I don't think so those >> hosts are behind NAT >> as there was no requirement like that for those hosts to access >> Internet . Anyway , I will >> confirm regarding this from the network-admin . >> >> P.S: Is there is any utility that can tell us whether we are behind >> NAT or not . > > If “ifconfig” on one host gives you different IP addresses then the > other host see as incoming IP then you are behind NAT. > Sorry , I failed to understand the above statement . But I have something in mind , I will try it tomorrow . From the source machine , I send a packet to remote host on a different network . Now If I capture packet on the remote host and it comes out to be different ip address than the source host , then probably I am behind NAT. These hosts are having private ip address. Will it help to help me know whether I am behind NAT ? > There may be some other services that close the connection like > firewalls and some such. You should consult if there are any on the > path and whether thous could drop connections with your network > administrator. > > Also Glynn's suggestion of making keep alive timeout shorter may work. > > I find it hard to believe that high CPU usage could cause connection > dropping unless you have some *really* busy machine but then you should > consider upgrading hardware or rethinking what services those serves > provide. These hosts are not providing any Internet service and mainly responsible for processing around Gigabits of data . The processing continues for around 24 hours , During the processing , it utilizes around 100% CPU for around 4 hours and the connection drop happened during that time . Not sure what processig goes on during the time which takes all the CPU . The user we are talking here is the user under whom these processes runs . Hope it helps to understand the scenario . > > -- > Best regards, _ _ > .o. | Liege of Serenly Enlightened Majesty of o' \,=./ `o > ..o | Computer Science, Michal "mina86" Nazarewicz (o o) > ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo-- > Thanks Zaman -- To unsubscribe from this list: send the line "unsubscribe linux-admin" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Preventing SSH timeouts . Some clarification needed 2010-06-08 15:10 ` query @ 2010-06-08 19:48 ` Michal Nazarewicz 2010-06-09 5:33 ` query 0 siblings, 1 reply; 16+ messages in thread From: Michal Nazarewicz @ 2010-06-08 19:48 UTC (permalink / raw) To: query; +Cc: linux-admin [-- Attachment #1: Type: text/plain, Size: 2246 bytes --] > 2010/6/8 Michal Nazarewicz <mina86@tlen.pl>: >> If “ifconfig” on one host gives you different IP addresses then the >> other host see as incoming IP then you are behind NAT. query <query.cdac@gmail.com> writes: > Sorry , I failed to understand the above statement . But I have > something in mind , I will try it tomorrow . > From the source machine , I send a packet to remote host on a > different network . Now If I capture packet on the remote host and > it comes out to be different ip address than the source host , then > probably I am behind NAT. These hosts are having private ip address. Yes, that's what I've written. ;) You run “ifconfig” an one machine at it will show you what's it's IP address (there may be several interfaces). Then you connect from this machine to the other machine and check the source address on the other machine. Repeat for the other direction even thou, if you actually can connect from one machine to the other then it is likely that the one you are connecting to is not behind NAT. > These hosts are not providing any Internet service and mainly > responsible for processing around Gigabits of data . The processing > continues for around 24 hours , During the processing , it utilizes > around 100% CPU for around 4 hours and the connection drop happened > during that time . Not sure what processig goes on during the time > which takes all the CPU . The user we are talking here is the user > under whom these processes runs . If the processing takes so much time and is so CPU consuming I'd try running it with nice, ie.: “nice -n 20 process-data” rather then plain “process-data” command. If the dropping in fact happens only while high CPU usage than maybe in fact it is a culprit. By running the processing via nice it gets lower priority (so in effect everything else gets higher priority in comparison). This could help SSH get its desired CPU interval in time. -- Best regards, _ _ .o. | Liege of Serenly Enlightened Majesty of o' \,=./ `o ..o | Computer Science, Michal "mina86" Nazarewicz (o o) ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo-- [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Preventing SSH timeouts . Some clarification needed 2010-06-08 19:48 ` Michal Nazarewicz @ 2010-06-09 5:33 ` query 0 siblings, 0 replies; 16+ messages in thread From: query @ 2010-06-09 5:33 UTC (permalink / raw) To: Michal Nazarewicz; +Cc: linux-admin Hi Guys , Finally got the clarification . We are not behind NAT . I checked myself by injecting some packets and sniffing the packets on the dest host as Michael suggested . I tried the experiment both ways and the no changes in ip address can be seen . I cross verified with our network-admin that those hosts are not behind NAT . So , most likely case for these connection drop-out are 2010/6/9 Michal Nazarewicz <mina86@tlen.pl>: >> 2010/6/8 Michal Nazarewicz <mina86@tlen.pl>: >>> If “ifconfig” on one host gives you different IP addresses then the >>> other host see as incoming IP then you are behind NAT. > > query <query.cdac@gmail.com> writes: >> Sorry , I failed to understand the above statement . But I have >> something in mind , I will try it tomorrow . >> From the source machine , I send a packet to remote host on a >> different network . Now If I capture packet on the remote host and >> it comes out to be different ip address than the source host , then >> probably I am behind NAT. These hosts are having private ip address. > > Yes, that's what I've written. ;) > > You run “ifconfig” an one machine at it will show you what's it's IP > address (there may be several interfaces). Then you connect from this > machine to the other machine and check the source address on the other > machine. Repeat for the other direction even thou, if you actually can > connect from one machine to the other then it is likely that the one you > are connecting to is not behind NAT. > >> These hosts are not providing any Internet service and mainly >> responsible for processing around Gigabits of data . The processing >> continues for around 24 hours , During the processing , it utilizes >> around 100% CPU for around 4 hours and the connection drop happened >> during that time . Not sure what processig goes on during the time >> which takes all the CPU . The user we are talking here is the user >> under whom these processes runs . > > If the processing takes so much time and is so CPU consuming I'd try > running it with nice, ie.: “nice -n 20 process-data” rather then plain > “process-data” command. > > If the dropping in fact happens only while high CPU usage than maybe in > fact it is a culprit. By running the processing via nice it gets lower > priority (so in effect everything else gets higher priority in > comparison). This could help SSH get its desired CPU interval in time. Thanks for this suggestion . But probably , we will not be able to do that . Our application itself is doing ssh to the other host and during high load the ssh connection drops and our application fails. > > -- > Best regards, _ _ > .o. | Liege of Serenly Enlightened Majesty of o' \,=./ `o > ..o | Computer Science, Michal "mina86" Nazarewicz (o o) > ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo-- > Thanks Zaman -- To unsubscribe from this list: send the line "unsubscribe linux-admin" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Preventing SSH timeouts . Some clarification needed 2010-06-08 9:36 Preventing SSH timeouts . Some clarification needed query 2010-06-08 9:48 ` Michal Nazarewicz @ 2010-06-08 10:39 ` Glynn Clements 2010-06-08 15:10 ` query 1 sibling, 1 reply; 16+ messages in thread From: Glynn Clements @ 2010-06-08 10:39 UTC (permalink / raw) To: query; +Cc: linux-admin query wrote: > One work around I could see is adding a timeout value using > ClientAliveInterval option in /etc/ssh/sshd_config on the server side > . Assume I have set the timeout value to 300 . > > > " The above option as per the sshd man page tells that it sets a > timeout interval in seconds after which if no data has been received > from the client, sshd(8) will send a message through the encrypted > channel to request a response from the client. " > > Let's take a situation where the SSH client is 100% busy or idle and > it had communicated to the server for around 300 seconds , then in > this case if the above option is there , the server should send a > message to the client after 300 secs . The following two scenarios are > coming to my mind. > > 1) if the server is also 100% busy during that time and could not > send the message to the client , will the ssh connection will be > dropped . > 2) or Suppose the server was somewhat free after 350 secs , in that > case will it drop the connection or it will send a message to the > client to check whether the client is active or not since it could not > send the message at 300 s as it was busy during the time . According to the sshd_config(5) manpage, the server will close the connection after ClientAliveCountMax messages (default value: 3) have been sent. I can't see how this can be caused by load. If you haven't yet enabled ClientAliveInterval, then the connection isn't being closed by sshd but by the kernel, due to TCP keep-alives not being acknowledged. By default, the kernel doesn't start sending keep-alives until the connection has been idle for 2 hours, after which it sends 9 probes at an interval of 75 seconds, so the system would need to be non-responsive for over 11 minutes. And the responses are generated by the kernel, so they'll be sent even if the process is suspended. As Michal suggests, the most likely reason for this is a NAT timeout. If you're using NAT, you probably want to set the keep-alive time (/proc/sys/net/ipv4/tcp_keepalive_time) to a value less than the NAT timeout. Even then, that will only work for programs which enable keep-alive (ssh and sshd both do by default; this is controlled by the TCPKeepAlive option). -- Glynn Clements <glynn@gclements.plus.com> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Preventing SSH timeouts . Some clarification needed 2010-06-08 10:39 ` Glynn Clements @ 2010-06-08 15:10 ` query 2010-06-08 16:19 ` Glynn Clements 0 siblings, 1 reply; 16+ messages in thread From: query @ 2010-06-08 15:10 UTC (permalink / raw) To: Glynn Clements; +Cc: linux-admin Thanks for the suggestion . On Tue, Jun 8, 2010 at 4:09 PM, Glynn Clements <glynn@gclements.plus.com> wrote: > > query wrote: > >> One work around I could see is adding a timeout value using >> ClientAliveInterval option in /etc/ssh/sshd_config on the server side >> . Assume I have set the timeout value to 300 . >> >> >> " The above option as per the sshd man page tells that it sets a >> timeout interval in seconds after which if no data has been received >> from the client, sshd(8) will send a message through the encrypted >> channel to request a response from the client. " >> >> Let's take a situation where the SSH client is 100% busy or idle and >> it had communicated to the server for around 300 seconds , then in >> this case if the above option is there , the server should send a >> message to the client after 300 secs . The following two scenarios are >> coming to my mind. >> >> 1) if the server is also 100% busy during that time and could not >> send the message to the client , will the ssh connection will be >> dropped . >> 2) or Suppose the server was somewhat free after 350 secs , in that >> case will it drop the connection or it will send a message to the >> client to check whether the client is active or not since it could not >> send the message at 300 s as it was busy during the time . > > According to the sshd_config(5) manpage, the server will close the > connection after ClientAliveCountMax messages (default value: 3) have > been sent. > > I can't see how this can be caused by load. If you haven't yet enabled > ClientAliveInterval, then the connection isn't being closed by sshd > but by the kernel, due to TCP keep-alives not being acknowledged. okay...that may be the cause . The client host was also busy because of which TCP keep-alive were not acknowledged. > > By default, the kernel doesn't start sending keep-alives until the > connection has been idle for 2 hours, after which it sends 9 probes at > an interval of 75 seconds, so the system would need to be > non-responsive for over 11 minutes. And the responses are generated by > the kernel, so they'll be sent even if the process is suspended. > > As Michal suggests, the most likely reason for this is a NAT timeout. > If you're using NAT, you probably want to set the keep-alive time > (/proc/sys/net/ipv4/tcp_keepalive_time) to a value less than the NAT > timeout. Even then, that will only work for programs which enable > keep-alive (ssh and sshd both do by default; this is controlled by the > TCPKeepAlive option). How to determine the value of NAT timeout . Is it at the host level or the device where NATing is implemented . I was able to find the keepalive timeout value at the host . ==== $ sudo sysctl -a | grep -i keepalive net.ipv4.tcp_keepalive_time = 7200 net.ipv4.tcp_keepalive_probes = 9 net.ipv4.tcp_keepalive_intvl = 75 ===== Most likely I am not behind NAT , I will confirm it tomorrow . If that is the case , then which should I consider to increase the timeout value. The kernel timeout value or implement either TCPKeepAlive option or the ClientAliveInterval interval . TCPKeepAlive option is somehow disabled in the sshd config file . Please clarify regarding this. > > -- > Glynn Clements <glynn@gclements.plus.com> > Once again , thanks all --Zaman -- To unsubscribe from this list: send the line "unsubscribe linux-admin" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Preventing SSH timeouts . Some clarification needed 2010-06-08 15:10 ` query @ 2010-06-08 16:19 ` Glynn Clements 2010-06-09 6:44 ` query 0 siblings, 1 reply; 16+ messages in thread From: Glynn Clements @ 2010-06-08 16:19 UTC (permalink / raw) To: query; +Cc: linux-admin query wrote: > > I can't see how this can be caused by load. If you haven't yet enabled > > ClientAliveInterval, then the connection isn't being closed by sshd > > but by the kernel, due to TCP keep-alives not being acknowledged. > > okay...that may be the cause . The client host was also busy because > of which TCP keep-alive were not acknowledged. Load won't have any effect upon TCP keep-alives, as it's the kernel which acknowledges keep-alive packets, not the user process. Keep-alive allows you to detect that a host is unreachable (e.g. network failure, system crash, power failure, etc). It doesn't tell you anything about an individual process. > > As Michal suggests, the most likely reason for this is a NAT timeout. > > If you're using NAT, you probably want to set the keep-alive time > > (/proc/sys/net/ipv4/tcp_keepalive_time) to a value less than the NAT > > timeout. Even then, that will only work for programs which enable > > keep-alive (ssh and sshd both do by default; this is controlled by the > > TCPKeepAlive option). > > How to determine the value of NAT timeout . Is it at the host level or > the device where NATing is implemented . The device which performs NAT. > I was able to find the keepalive timeout value at the host . > > ==== > $ sudo sysctl -a | grep -i keepalive > net.ipv4.tcp_keepalive_time = 7200 > net.ipv4.tcp_keepalive_probes = 9 > net.ipv4.tcp_keepalive_intvl = 75 > ===== > > Most likely I am not behind NAT , I will confirm it tomorrow . If that > is the case , then which should I consider to increase the timeout > value. > The kernel timeout value or implement either TCPKeepAlive option or > the ClientAliveInterval interval . TCPKeepAlive option is somehow > disabled in the sshd config file . Please clarify regarding this. TCPKeepAlive is enabled by default. But even if it's enabled, the 2-hour wait before any keep-alives are sent typically won't be enough to prevent NAT entries from expiring. Even the 5-minute interval between SSH keep-alives may be longer than the NAT expiry time. Low-end router/modem devices with built-in NAT seem base their default configuration on the assumption that you're using HTTP from Win95 boxes, where a connection being idle for more than 30 seconds usually means that the Win95 box has crashed. Another possibility is a really cheap ISP which uses (a heavily oversubscribed pool of) dynamic IP addresses, which expire whenever the connection is idle for more than a minute. -- Glynn Clements <glynn@gclements.plus.com> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Preventing SSH timeouts . Some clarification needed 2010-06-08 16:19 ` Glynn Clements @ 2010-06-09 6:44 ` query 2010-06-09 8:15 ` Adam T. Bowen 2010-06-09 10:14 ` Glynn Clements 0 siblings, 2 replies; 16+ messages in thread From: query @ 2010-06-09 6:44 UTC (permalink / raw) To: linux-admin; +Cc: Glynn Clements, Michal Nazarewicz Guys , since we are clear now that we are not behind NAT , so we can forget now about reducing the keep-alive time (/proc/sys/net/ipv4/tcp_keepalive_time) to a value less than the NAT timeout. But anyways , I learn something new :) . The most likely reason which Michael also agreed can be the high load on both the system . So, do you suggest now to enable to enable the ClientAliveInterval option . Also , since ClientAliveCountMax is enabled by default with a value of 3 , so probably I will keep the value of ClientAliveInterval less than 300 secs . I will probably keep it at 60 secs. So , the connection will dropout after 180 secs if there is no response . Also , somewhat strange , TCPKeepAlive option is disabled in our sshd_config file , not sure why . So , If ClientAliveInterval is enabled , can we can leave TCPKeepAlive disabled . Is our purpose will serve ? Thanks Zaman On Tue, Jun 8, 2010 at 9:49 PM, Glynn Clements <glynn@gclements.plus.com> wrote: > > query wrote: > >> > I can't see how this can be caused by load. If you haven't yet enabled >> > ClientAliveInterval, then the connection isn't being closed by sshd >> > but by the kernel, due to TCP keep-alives not being acknowledged. >> >> okay...that may be the cause . The client host was also busy because >> of which TCP keep-alive were not acknowledged. > > Load won't have any effect upon TCP keep-alives, as it's the kernel > which acknowledges keep-alive packets, not the user process. > > Keep-alive allows you to detect that a host is unreachable (e.g. > network failure, system crash, power failure, etc). It doesn't tell > you anything about an individual process. > >> > As Michal suggests, the most likely reason for this is a NAT timeout. >> > If you're using NAT, you probably want to set the keep-alive time >> > (/proc/sys/net/ipv4/tcp_keepalive_time) to a value less than the NAT >> > timeout. Even then, that will only work for programs which enable >> > keep-alive (ssh and sshd both do by default; this is controlled by the >> > TCPKeepAlive option). >> >> How to determine the value of NAT timeout . Is it at the host level or >> the device where NATing is implemented . > > The device which performs NAT. > >> I was able to find the keepalive timeout value at the host . >> >> ==== >> $ sudo sysctl -a | grep -i keepalive >> net.ipv4.tcp_keepalive_time = 7200 >> net.ipv4.tcp_keepalive_probes = 9 >> net.ipv4.tcp_keepalive_intvl = 75 >> ===== >> >> Most likely I am not behind NAT , I will confirm it tomorrow . If that >> is the case , then which should I consider to increase the timeout >> value. >> The kernel timeout value or implement either TCPKeepAlive option or >> the ClientAliveInterval interval . TCPKeepAlive option is somehow >> disabled in the sshd config file . Please clarify regarding this. > > TCPKeepAlive is enabled by default. But even if it's enabled, the > 2-hour wait before any keep-alives are sent typically won't be enough > to prevent NAT entries from expiring. > > Even the 5-minute interval between SSH keep-alives may be longer than > the NAT expiry time. Low-end router/modem devices with built-in NAT > seem base their default configuration on the assumption that you're > using HTTP from Win95 boxes, where a connection being idle for more > than 30 seconds usually means that the Win95 box has crashed. > > Another possibility is a really cheap ISP which uses (a heavily > oversubscribed pool of) dynamic IP addresses, which expire whenever > the connection is idle for more than a minute. > > -- > Glynn Clements <glynn@gclements.plus.com> > -- To unsubscribe from this list: send the line "unsubscribe linux-admin" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Preventing SSH timeouts . Some clarification needed 2010-06-09 6:44 ` query @ 2010-06-09 8:15 ` Adam T. Bowen 2010-06-09 10:14 ` Glynn Clements 1 sibling, 0 replies; 16+ messages in thread From: Adam T. Bowen @ 2010-06-09 8:15 UTC (permalink / raw) To: query; +Cc: linux-admin [-- Attachment #1: Type: text/plain, Size: 4453 bytes --] Hi, Another setting to try, on the client side, is the ServerAliveInterval. This sets a keep alive packet to be sent within the SSH protocol, as opposed to TCPKeepAlive which is within the underlying TCP connection. I have had the misfortune to be behind firewalls that have harvested "dead" connections far too quickly, in my opinion, and this setting worked for me where TCPKeepAlive didn't. Worth a try. Cheers, Adam On 09/06/10 07:44, query wrote: > Guys , since we are clear now that we are not behind NAT , so we can > forget now about reducing the keep-alive time > (/proc/sys/net/ipv4/tcp_keepalive_time) to a value less than the NAT > timeout. But anyways , I learn something new :) . > The most likely reason which Michael also agreed can be the high load > on both the system . > > So, do you suggest now to enable to enable the ClientAliveInterval > option . Also , since ClientAliveCountMax is enabled by default with a > value of 3 , > so probably I will keep the value of ClientAliveInterval less than 300 > secs . I will probably keep it at 60 secs. So , the connection will > dropout after 180 secs if there is no response . > > Also , somewhat strange , TCPKeepAlive option is disabled in our > sshd_config file , not sure why . So , If ClientAliveInterval is > enabled , can we can leave TCPKeepAlive disabled . Is our purpose > will serve ? > > > Thanks > Zaman > > On Tue, Jun 8, 2010 at 9:49 PM, Glynn Clements <glynn@gclements.plus.com> wrote: >> >> query wrote: >> >>>> I can't see how this can be caused by load. If you haven't yet enabled >>>> ClientAliveInterval, then the connection isn't being closed by sshd >>>> but by the kernel, due to TCP keep-alives not being acknowledged. >>> >>> okay...that may be the cause . The client host was also busy because >>> of which TCP keep-alive were not acknowledged. >> >> Load won't have any effect upon TCP keep-alives, as it's the kernel >> which acknowledges keep-alive packets, not the user process. >> >> Keep-alive allows you to detect that a host is unreachable (e.g. >> network failure, system crash, power failure, etc). It doesn't tell >> you anything about an individual process. >> >>>> As Michal suggests, the most likely reason for this is a NAT timeout. >>>> If you're using NAT, you probably want to set the keep-alive time >>>> (/proc/sys/net/ipv4/tcp_keepalive_time) to a value less than the NAT >>>> timeout. Even then, that will only work for programs which enable >>>> keep-alive (ssh and sshd both do by default; this is controlled by the >>>> TCPKeepAlive option). >>> >>> How to determine the value of NAT timeout . Is it at the host level or >>> the device where NATing is implemented . >> >> The device which performs NAT. >> >>> I was able to find the keepalive timeout value at the host . >>> >>> ==== >>> $ sudo sysctl -a | grep -i keepalive >>> net.ipv4.tcp_keepalive_time = 7200 >>> net.ipv4.tcp_keepalive_probes = 9 >>> net.ipv4.tcp_keepalive_intvl = 75 >>> ===== >>> >>> Most likely I am not behind NAT , I will confirm it tomorrow . If that >>> is the case , then which should I consider to increase the timeout >>> value. >>> The kernel timeout value or implement either TCPKeepAlive option or >>> the ClientAliveInterval interval . TCPKeepAlive option is somehow >>> disabled in the sshd config file . Please clarify regarding this. >> >> TCPKeepAlive is enabled by default. But even if it's enabled, the >> 2-hour wait before any keep-alives are sent typically won't be enough >> to prevent NAT entries from expiring. >> >> Even the 5-minute interval between SSH keep-alives may be longer than >> the NAT expiry time. Low-end router/modem devices with built-in NAT >> seem base their default configuration on the assumption that you're >> using HTTP from Win95 boxes, where a connection being idle for more >> than 30 seconds usually means that the Win95 box has crashed. >> >> Another possibility is a really cheap ISP which uses (a heavily >> oversubscribed pool of) dynamic IP addresses, which expire whenever >> the connection is idle for more than a minute. >> >> -- >> Glynn Clements <glynn@gclements.plus.com> >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-admin" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 6452 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Preventing SSH timeouts . Some clarification needed 2010-06-09 6:44 ` query 2010-06-09 8:15 ` Adam T. Bowen @ 2010-06-09 10:14 ` Glynn Clements [not found] ` <AANLkTimDS_IalexVnOKtOuKN8fz13rFumHV8TrjEGtph@mail.gmail.com> 1 sibling, 1 reply; 16+ messages in thread From: Glynn Clements @ 2010-06-09 10:14 UTC (permalink / raw) To: query; +Cc: linux-admin, Michal Nazarewicz query wrote: > Guys , since we are clear now that we are not behind NAT , so we can > forget now about reducing the keep-alive time Note that the same issue applies for firewalls which use connection tracking to determine which packets to allow. Ultimately, it's the connection tracking that's the issue, not NAT per se. If the router "forgets" a connection because it hasn't seen any traffic in a long time, and the result of this is that subsequent packets are silently discarded, then the connection will cease to work, resulting in a timeout occurring the next time either side tries to send data. This isn't an issue if connection tracking is only used for scheduling. > So, do you suggest now to enable to enable the ClientAliveInterval > option . Also , since ClientAliveCountMax is enabled by default with a > value of 3 , > so probably I will keep the value of ClientAliveInterval less than 300 > secs . I will probably keep it at 60 secs. So , the connection will > dropout after 180 secs if there is no response . > > Also , somewhat strange , TCPKeepAlive option is disabled in our > sshd_config file , not sure why . So , If ClientAliveInterval is > enabled , can we can leave TCPKeepAlive disabled . Is our purpose > will serve ? The main reason to disable TCP keep-alives is that they can cause a connection to be dropped unnecessarily. A secondary reason is that they will cause an on-demand link-layer connection to be opened unnecessarily, but that's less of issue nowadays. Without keep-alives, an idle TCP connection doesn't cause any packets to be sent. The physical link could be down for days, but the TCP connection will remain open so long as no packets are sent while the link is down. Enabling keep-alives will cause the connection to fail in this situation. The main purpose of keep-alives is to prevent the situation where the client system crashes, leaving the server process listening for packets which will never arive. Without keep-alives, there's no way to distinguish between a client which has permanently vanished and one which has merely been idle for a long time. The SSH keep-alives (ClientAliveInterval and ServerAliveInterval) serve a similar purpose (to force a connection to be terminated when the other end disappears without explicitly closing the connection), except that the SSH protocol prevents spoofing. This prevents the situation where an attacker "silences" one side of the connection and spoofs TCP keep-alives to prevent the server from realising that something has happened. -- Glynn Clements <glynn@gclements.plus.com> ^ permalink raw reply [flat|nested] 16+ messages in thread
[parent not found: <AANLkTimDS_IalexVnOKtOuKN8fz13rFumHV8TrjEGtph@mail.gmail.com>]
[parent not found: <19471.47290.566464.539451@cerise.gclements.plus.com>]
* Re: Preventing SSH timeouts . Some clarification needed [not found] ` <19471.47290.566464.539451@cerise.gclements.plus.com> @ 2010-06-10 6:02 ` query 2010-06-10 13:03 ` Glynn Clements 0 siblings, 1 reply; 16+ messages in thread From: query @ 2010-06-10 6:02 UTC (permalink / raw) To: Glynn Clements; +Cc: linux-admin On Wed, Jun 9, 2010 at 9:22 PM, Glynn Clements <glynn@gclements.plus.com> wrote: > > query wrote: > >> okay..So ,what I can understand is that keep-alives or similar like >> (ClientAliveInterval and ServerAliveInterval) options are never >> going to help to prevent those timeouts . Enabling those options , >> will only adverse the situation . > > Not necessarily. If the problem is caused by connection tracking > expiring the connection, keep-alives may prevent this from happening, > although the default settings for TCP keep-alives are probably > insufficient. > >> So , if the client host is busy for a long time and is not able to >> send any messages to the SSH server , then the server will drop the >> connection assuming that the client has crashed for whatever reason >> if keep-alives like options are enabled . > > Yes, for SSH keep-alives. TCP keep-alives are handled by the kernel, > and only require that the host is functioning and connected. Even if > the ssh or sshd processes were completely suspended (in the sense of > "kill -STOP ..."), TCP keep-alives will continue to be sent and/or > acknowledged. > >> On the other hand , if >> keep-alive option is disabled , the server is never going to drop the >> SSH connection even if the client crashes or 100% busy ( could not >> send a message to the server) or idle . The SSH connection drop was >> initiated by the kernel as you mentioned in your first comment and we >> can do nothing on the SSH configurations to avoid those timeouts. > > If the problem is due to connection tracking, enabling frequent > keep-alives should prevent the connection from expiring. However, this > can cause a connection to be dropped if the system is under heavy > load, even if the connection is otherwise idle. The risk can be > reduced by increasing the value for the ClientAliveCountMax or > ServerAliveCountMax options, so that the connection is only dropped if > the process stops responding for an extended period. okay..Thanks for the clarification . Since the host sometimes continues to remain busy for around 2 hours , so the ClientAliveCountMax should be a high value in our case . ========== cpu mem Time %util %util 06/07-23:00 - - 100.0 17.4 06/07-23:30 - - 100.0 18.1 06/08-00:00 - - 100.0 18.0 06/08-00:30 - - 100.0 17.4 ========= Since I am not sure of the connection tracking timeout value , So , I am planning to put a value of (ClientAliveInterval and ServerAliveInterval) to be 300 secs and CountMax value to be 24 so that even in the worst case of high load , it continues to send message to the server so that the connection does not break. Since in our case , both the client and server remains busy at the same time , so I am planning to use the option on both the client and server , so that either of it can send a send a SSH keep alive message to inform the router that the connection is alive. But I hope it will not add any extra load on the server since already the CPU is 100% high . Thanks Zaman > > -- > Glynn Clements <glynn@gclements.plus.com> > -- To unsubscribe from this list: send the line "unsubscribe linux-admin" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Preventing SSH timeouts . Some clarification needed 2010-06-10 6:02 ` query @ 2010-06-10 13:03 ` Glynn Clements 2010-06-10 16:35 ` query 0 siblings, 1 reply; 16+ messages in thread From: Glynn Clements @ 2010-06-10 13:03 UTC (permalink / raw) To: query; +Cc: linux-admin query wrote: > okay..Thanks for the clarification . Since the host sometimes > continues to remain busy for around 2 hours , Busy to the point that ssh/sshd doesn't get *any* CPU time for 2 hours? Either you're misunderstanding something, or that's a seriously misconfigured server. In general, processes which need a lot of CPU cycles should have a lower priority than those which need little. The relative "importance" of processes doesn't matter here. A system where the key process gets 95% CPU while support processes get the other 5% is preferable to one where the key process gets 100% CPU and support processes are suspended for long periods. -- Glynn Clements <glynn@gclements.plus.com> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Preventing SSH timeouts . Some clarification needed 2010-06-10 13:03 ` Glynn Clements @ 2010-06-10 16:35 ` query 2010-06-10 23:52 ` Glynn Clements 0 siblings, 1 reply; 16+ messages in thread From: query @ 2010-06-10 16:35 UTC (permalink / raw) To: Glynn Clements; +Cc: linux-admin On Thu, Jun 10, 2010 at 6:33 PM, Glynn Clements <glynn@gclements.plus.com> wrote: > > query wrote: > >> okay..Thanks for the clarification . Since the host sometimes >> continues to remain busy for around 2 hours , > > Busy to the point that ssh/sshd doesn't get *any* CPU time for 2 > hours? Either you're misunderstanding something, or that's a seriously > misconfigured server. That is my misunderstanding only .The CPU is 100% busy but it is not that all the 100% is being utilized by our process and no other process is getting the CPU time. I will calculate an optimal value by going through once more over the system during the peak CPU utilization . But I am still confused who is terminating the connection in our case and on how is calculating the timeout value. AS you mentioned in your first comment that it the kernel who is terminating the connection , but based on what it is terminating the connection . As you said earlier , Keep-alive allows us to detect that a host is unreachable (e.g. network failure, system crash, power failure, etc) , It is not going to kill sshd , Apologies for repeating the same question , but I am still confused regarding this. Thanks Zaman > > In general, processes which need a lot of CPU cycles should have a > lower priority than those which need little. The relative "importance" > of processes doesn't matter here. A system where the key process gets > 95% CPU while support processes get the other 5% is preferable to one > where the key process gets 100% CPU and support processes are > suspended for long periods. > > -- > Glynn Clements <glynn@gclements.plus.com> > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Preventing SSH timeouts . Some clarification needed 2010-06-10 16:35 ` query @ 2010-06-10 23:52 ` Glynn Clements 2010-06-11 7:22 ` query 0 siblings, 1 reply; 16+ messages in thread From: Glynn Clements @ 2010-06-10 23:52 UTC (permalink / raw) To: query; +Cc: linux-admin query wrote: > >> okay..Thanks for the clarification . Since the host sometimes > >> continues to remain busy for around 2 hours , > > > > Busy to the point that ssh/sshd doesn't get *any* CPU time for 2 > > hours? Either you're misunderstanding something, or that's a seriously > > misconfigured server. > > That is my misunderstanding only .The CPU is 100% busy but it is not > that all the 100% is being utilized by our process and no other > process is getting the CPU time. I will calculate an optimal value by > going through once more over the system during the peak CPU > utilization . > But I am still confused who is terminating the connection in our case > and on how is calculating the timeout value. > AS you mentioned in your first comment that it the kernel who is > terminating the connection , but based on what it is terminating > the connection . As you said earlier , Keep-alive allows us to detect > that a host is unreachable (e.g. > network failure, system crash, power failure, etc) , It is not going > to kill sshd , It won't kill sshd; however, if packets (data or keep-alives) which are sent to the client stop being acknowledged, operations on the socket will eventually fail with ETIMEDOUT. At this point, sshd will close the connection of its own accord. The relevant setting is /proc/sys/net/ipv4/tcp_retries2: tcp_retries2 (integer; default: 15; since Linux 2.2) The maximum number of times a TCP packet is retransmitted in established state before giving up. The default value is 15, which corresponds to a duration of approximately between 13 to 30 minutes, depending on the retransmission timeout. The RFC 1122 specified minimum limit of 100 seconds is typically deemed too short. The initial retransmission timeout is determined by the measured round-trip latency for the connection. Subsequent retransmissions occur at exponentially increasing intervals, capped at 120 seconds. If keep-alives aren't being sent, the connection can only time out as a result of data being sent. If keep-alives are being sent, a timeout can occur even if the connection is otherwise idle (that's the purpose of keep-alives). -- Glynn Clements <glynn@gclements.plus.com> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Preventing SSH timeouts . Some clarification needed 2010-06-10 23:52 ` Glynn Clements @ 2010-06-11 7:22 ` query 0 siblings, 0 replies; 16+ messages in thread From: query @ 2010-06-11 7:22 UTC (permalink / raw) To: Glynn Clements; +Cc: linux-admin Clements , thanks for all the detailed explanation . I think things are clear to me now . Will try to apply the changes in sshd_config . And Thanks Michael and all for providing insights on the issue . Thanks Zaman On Fri, Jun 11, 2010 at 5:22 AM, Glynn Clements <glynn@gclements.plus.com> wrote: > > query wrote: > >> >> okay..Thanks for the clarification . Since the host sometimes >> >> continues to remain busy for around 2 hours , >> > >> > Busy to the point that ssh/sshd doesn't get *any* CPU time for 2 >> > hours? Either you're misunderstanding something, or that's a seriously >> > misconfigured server. >> >> That is my misunderstanding only .The CPU is 100% busy but it is not >> that all the 100% is being utilized by our process and no other >> process is getting the CPU time. I will calculate an optimal value by >> going through once more over the system during the peak CPU >> utilization . >> But I am still confused who is terminating the connection in our case >> and on how is calculating the timeout value. >> AS you mentioned in your first comment that it the kernel who is >> terminating the connection , but based on what it is terminating >> the connection . As you said earlier , Keep-alive allows us to detect >> that a host is unreachable (e.g. >> network failure, system crash, power failure, etc) , It is not going >> to kill sshd , > > It won't kill sshd; however, if packets (data or keep-alives) which > are sent to the client stop being acknowledged, operations on the > socket will eventually fail with ETIMEDOUT. At this point, sshd will > close the connection of its own accord. > > The relevant setting is /proc/sys/net/ipv4/tcp_retries2: > > tcp_retries2 (integer; default: 15; since Linux 2.2) > The maximum number of times a TCP packet is retransmitted in > established state before giving up. The default value is 15, > which corresponds to a duration of approximately between 13 to > 30 minutes, depending on the retransmission timeout. The > RFC 1122 specified minimum limit of 100 seconds is typically > deemed too short. > > The initial retransmission timeout is determined by the measured > round-trip latency for the connection. Subsequent retransmissions > occur at exponentially increasing intervals, capped at 120 seconds. > > If keep-alives aren't being sent, the connection can only time out as > a result of data being sent. If keep-alives are being sent, a timeout > can occur even if the connection is otherwise idle (that's the purpose > of keep-alives). > > -- > Glynn Clements <glynn@gclements.plus.com> > -- To unsubscribe from this list: send the line "unsubscribe linux-admin" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2010-06-11 7:22 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-08 9:36 Preventing SSH timeouts . Some clarification needed query
2010-06-08 9:48 ` Michal Nazarewicz
[not found] ` <AANLkTimTjSOmbj_ac4iiUMaHRuvp1-ljW-FUGAQbb1qt@mail.gmail.com>
[not found] ` <87k4q9g31y.fsf@erwin.mina86.com>
2010-06-08 15:10 ` query
2010-06-08 19:48 ` Michal Nazarewicz
2010-06-09 5:33 ` query
2010-06-08 10:39 ` Glynn Clements
2010-06-08 15:10 ` query
2010-06-08 16:19 ` Glynn Clements
2010-06-09 6:44 ` query
2010-06-09 8:15 ` Adam T. Bowen
2010-06-09 10:14 ` Glynn Clements
[not found] ` <AANLkTimDS_IalexVnOKtOuKN8fz13rFumHV8TrjEGtph@mail.gmail.com>
[not found] ` <19471.47290.566464.539451@cerise.gclements.plus.com>
2010-06-10 6:02 ` query
2010-06-10 13:03 ` Glynn Clements
2010-06-10 16:35 ` query
2010-06-10 23:52 ` Glynn Clements
2010-06-11 7:22 ` query
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.