Preventing SSH timeouts . Some clarification needed

All of lore.kernel.org
 help / color / mirror / Atom feed

* Preventing SSH timeouts . Some clarification needed
@ 2010-06-08  9:36 query
  2010-06-08  9:48 ` Michal Nazarewicz
  2010-06-08 10:39 ` Glynn Clements
  0 siblings, 2 replies; 16+ messages in thread
From: query @ 2010-06-08  9:36 UTC (permalink / raw)
  To: linux-admin

Hi ,

   We are seeing some dropped SSH connections because of which  some
of the process are failing . The main likely reason for the connection
drops is that both the client and server remains 100% busy during a
certain time interval and during that time interval we see those
occassional connection closed by the server.

===
 ssh_exchange_identification: Connection closed by remote host^M
  Return_status (65280)
  Exit_value (255)
  End_time (03 Jun 2010 22:41:41)
====

One work around I could see is adding a  timeout value using
ClientAliveInterval option in /etc/ssh/sshd_config on the server side
. Assume I have set the timeout value to 300 .

" The above option as per the sshd man page tells that it sets a
timeout interval in seconds after which if no data has been received
from the client, sshd(8) will send a message through the encrypted
channel to request a response from the client. "

Let's take a situation where the SSH client is 100% busy or idle and
it had communicated to the server for around 300 seconds , then in
this case if the above option is there , the server should send a
message to the client after 300 secs . The following two scenarios are
coming to my mind.

1)  if the server is also 100% busy during that time and could not
send the message to the client , will the ssh connection will be
dropped .
2) or Suppose the server was somewhat free after 350 secs , in that
case will it drop the connection or it will send a message to the
client to check whether the client is active or not since it could not
send the message at 300 s as it was busy during the time .

Please clarify the ssh behaviour for the above scenarious . I hope  I
am clear with the question and the above scenarious makes sense .

Thanks
Zaman

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Preventing SSH timeouts . Some clarification needed
  2010-06-08  9:36 Preventing SSH timeouts . Some clarification needed query
@ 2010-06-08  9:48 ` Michal Nazarewicz
       [not found]   ` <AANLkTimTjSOmbj_ac4iiUMaHRuvp1-ljW-FUGAQbb1qt@mail.gmail.com>
  2010-06-08 10:39 ` Glynn Clements
  1 sibling, 1 reply; 16+ messages in thread
From: Michal Nazarewicz @ 2010-06-08  9:48 UTC (permalink / raw)
  To: query; +Cc: linux-admin

[-- Attachment #1: Type: text/plain, Size: 1331 bytes --]

query <query.cdac@gmail.com> writes:
>  We are seeing some dropped SSH connections because of which some of
> the process are failing . The main likely reason for the connection
> drops is that both the client and server remains 100% busy during
> a certain time interval and during that time interval we see those
> occassional connection closed by the server.

Are you sure it's not because of some NATing which may have a shorter
timeout then the one used by SSH's keep alive?

> Let's take a situation where the SSH client is 100% busy or idle and
> it had communicated to the server for around 300 seconds , then in
> this case if the above option is there , the server should send a
> message to the client after 300 secs.  [...]

300 seconds is a very long time.  I consider it unlikely that a process
that was idle for 300 seconds wouldn't quickly get a few CPU cycles just
to send a simple packet.  I find it possible only when you use
non-standard schedule policies.

Hope it helps even that I haven't answered your original question.

-- 
Best regards,                                         _     _
 .o. | Liege of Serenly Enlightened Majesty of      o' \,=./ `o
 ..o | Computer Science,  Michal "mina86" Nazarewicz   (o o)
 ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo--

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <AANLkTimTjSOmbj_ac4iiUMaHRuvp1-ljW-FUGAQbb1qt@mail.gmail.com>]

[parent not found: <87k4q9g31y.fsf@erwin.mina86.com>]

* Re: Preventing SSH timeouts . Some clarification needed
       [not found]     ` <87k4q9g31y.fsf@erwin.mina86.com>
@ 2010-06-08 15:10       ` query
  2010-06-08 19:48         ` Michal Nazarewicz
  0 siblings, 1 reply; 16+ messages in thread
From: query @ 2010-06-08 15:10 UTC (permalink / raw)
  To: Michal Nazarewicz; +Cc: linux-admin

2010/6/8 Michal Nazarewicz <mina86@tlen.pl>:
>> 2010/6/8 Michal Nazarewicz <mina86@tlen.pl>:
>>> Are you sure it's not because of some NATing which may have a shorter
>>> timeout then the one used by SSH's keep alive?
>
> query <query.cdac@gmail.com> writes:
>> I am not 100% sure but during the connection dropout  time , the CPU
>> is 100% busy as shown
>> by our own reporting utility.  Reg NAT ing , I don't think so those
>> hosts are behind NAT
>> as there was no requirement like that for those hosts to access
>> Internet .  Anyway , I will
>> confirm regarding this from the network-admin .
>>
>> P.S: Is there is any utility that can tell us whether we are behind
>> NAT or not .
>
> If “ifconfig” on one host gives you different IP addresses then the
> other host see as incoming IP then you are behind NAT.
>

Sorry , I failed to understand the above statement . But I have
something in mind , I will try it tomorrow .
From the source machine , I send a packet to remote host on a
different network   . Now If I  capture packet on the remote host and
it comes out to be different ip address than the source host , then
probably I am behind NAT.  These hosts are having private ip address.

Will it help to help me know whether I am behind NAT ?




> There may be some other services that close the connection like
> firewalls and some such.  You should consult if there are any on the
> path and whether thous could drop connections with your network
> administrator.
>
> Also Glynn's suggestion of making keep alive timeout shorter may work.
>
> I find it hard to believe that high CPU usage could cause connection
> dropping unless you have some *really* busy machine but then you should
> consider upgrading hardware or rethinking what services those serves
> provide.

These hosts are not providing any Internet service and mainly
responsible for processing around Gigabits of data . The processing
continues for around 24 hours ,
During the processing , it utilizes around 100% CPU for around 4 hours
and the connection drop happened during that time . Not sure what
processig goes on during the time which takes all the CPU . The user
we are talking here is the user under whom these processes runs .

Hope it helps to understand the scenario .
>
> --
> Best regards,                                         _     _
>  .o. | Liege of Serenly Enlightened Majesty of      o' \,=./ `o
>  ..o | Computer Science,  Michal "mina86" Nazarewicz   (o o)
>  ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo--
>
Thanks
Zaman
--
To unsubscribe from this list: send the line "unsubscribe linux-admin" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Preventing SSH timeouts . Some clarification needed
  2010-06-08 15:10       ` query
@ 2010-06-08 19:48         ` Michal Nazarewicz
  2010-06-09  5:33           ` query
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Nazarewicz @ 2010-06-08 19:48 UTC (permalink / raw)
  To: query; +Cc: linux-admin

[-- Attachment #1: Type: text/plain, Size: 2246 bytes --]

> 2010/6/8 Michal Nazarewicz <mina86@tlen.pl>:
>> If “ifconfig” on one host gives you different IP addresses then the
>> other host see as incoming IP then you are behind NAT.

query <query.cdac@gmail.com> writes:
> Sorry , I failed to understand the above statement . But I have
> something in mind , I will try it tomorrow .
> From the source machine , I send a packet to remote host on a
> different network   . Now If I  capture packet on the remote host and
> it comes out to be different ip address than the source host , then
> probably I am behind NAT.  These hosts are having private ip address.

Yes, that's what I've written. ;)

You run “ifconfig” an one machine at it will show you what's it's IP
address (there may be several interfaces).  Then you connect from this
machine to the other machine and check the source address on the other
machine.  Repeat for the other direction even thou, if you actually can
connect from one machine to the other then it is likely that the one you
are connecting to is not behind NAT.

> These hosts are not providing any Internet service and mainly
> responsible for processing around Gigabits of data . The processing
> continues for around 24 hours , During the processing , it utilizes
> around 100% CPU for around 4 hours and the connection drop happened
> during that time . Not sure what processig goes on during the time
> which takes all the CPU . The user we are talking here is the user
> under whom these processes runs .

If the processing takes so much time and is so CPU consuming I'd try
running it with nice, ie.: “nice -n 20 process-data” rather then plain
“process-data” command.

If the dropping in fact happens only while high CPU usage than maybe in
fact it is a culprit.  By running the processing via nice it gets lower
priority (so in effect everything else gets higher priority in
comparison).  This could help SSH get its desired CPU interval in time.

-- 
Best regards,                                         _     _
 .o. | Liege of Serenly Enlightened Majesty of      o' \,=./ `o
 ..o | Computer Science,  Michal "mina86" Nazarewicz   (o o)
 ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo--

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Preventing SSH timeouts . Some clarification needed
  2010-06-08 19:48         ` Michal Nazarewicz
@ 2010-06-09  5:33           ` query
  0 siblings, 0 replies; 16+ messages in thread
From: query @ 2010-06-09  5:33 UTC (permalink / raw)
  To: Michal Nazarewicz; +Cc: linux-admin

Hi Guys ,

Finally got the clarification . We are not behind NAT . I checked
myself by injecting some packets and sniffing the packets on the dest
host as Michael suggested .
I tried the experiment both ways and the no changes in ip address can
be seen . I cross verified with our network-admin that those hosts are
not behind NAT .
So , most likely case for these connection drop-out are



2010/6/9 Michal Nazarewicz <mina86@tlen.pl>:
>> 2010/6/8 Michal Nazarewicz <mina86@tlen.pl>:
>>> If “ifconfig” on one host gives you different IP addresses then the
>>> other host see as incoming IP then you are behind NAT.
>
> query <query.cdac@gmail.com> writes:
>> Sorry , I failed to understand the above statement . But I have
>> something in mind , I will try it tomorrow .
>> From the source machine , I send a packet to remote host on a
>> different network   . Now If I  capture packet on the remote host and
>> it comes out to be different ip address than the source host , then
>> probably I am behind NAT.  These hosts are having private ip address.
>
> Yes, that's what I've written. ;)
>
> You run “ifconfig” an one machine at it will show you what's it's IP
> address (there may be several interfaces).  Then you connect from this
> machine to the other machine and check the source address on the other
> machine.  Repeat for the other direction even thou, if you actually can
> connect from one machine to the other then it is likely that the one you
> are connecting to is not behind NAT.
>
>> These hosts are not providing any Internet service and mainly
>> responsible for processing around Gigabits of data . The processing
>> continues for around 24 hours , During the processing , it utilizes
>> around 100% CPU for around 4 hours and the connection drop happened
>> during that time . Not sure what processig goes on during the time
>> which takes all the CPU . The user we are talking here is the user
>> under whom these processes runs .
>
> If the processing takes so much time and is so CPU consuming I'd try
> running it with nice, ie.: “nice -n 20 process-data” rather then plain
> “process-data” command.
>
> If the dropping in fact happens only while high CPU usage than maybe in
> fact it is a culprit.  By running the processing via nice it gets lower
> priority (so in effect everything else gets higher priority in
> comparison).  This could help SSH get its desired CPU interval in time.

Thanks for this suggestion . But probably , we will not be able to do
that  . Our application itself is doing ssh to the other host and
during high load
the ssh connection drops and our application fails.

>
> --
> Best regards,                                         _     _
>  .o. | Liege of Serenly Enlightened Majesty of      o' \,=./ `o
>  ..o | Computer Science,  Michal "mina86" Nazarewicz   (o o)
>  ooo +--<mina86-tlen.pl>--<jid:mina86-jabber.org>--ooO--(_)--Ooo--
>

Thanks
Zaman
--
To unsubscribe from this list: send the line "unsubscribe linux-admin" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Preventing SSH timeouts . Some clarification needed
  2010-06-08  9:36 Preventing SSH timeouts . Some clarification needed query
  2010-06-08  9:48 ` Michal Nazarewicz
@ 2010-06-08 10:39 ` Glynn Clements
  2010-06-08 15:10   ` query
  1 sibling, 1 reply; 16+ messages in thread
From: Glynn Clements @ 2010-06-08 10:39 UTC (permalink / raw)
  To: query; +Cc: linux-admin

query wrote:

> One work around I could see is adding a  timeout value using
> ClientAliveInterval option in /etc/ssh/sshd_config on the server side
> . Assume I have set the timeout value to 300 .
> 
> 
> " The above option as per the sshd man page tells that it sets a
> timeout interval in seconds after which if no data has been received
> from the client, sshd(8) will send a message through the encrypted
> channel to request a response from the client. "
> 
> Let's take a situation where the SSH client is 100% busy or idle and
> it had communicated to the server for around 300 seconds , then in
> this case if the above option is there , the server should send a
> message to the client after 300 secs . The following two scenarios are
> coming to my mind.
> 
> 1)  if the server is also 100% busy during that time and could not
> send the message to the client , will the ssh connection will be
> dropped .
> 2) or Suppose the server was somewhat free after 350 secs , in that
> case will it drop the connection or it will send a message to the
> client to check whether the client is active or not since it could not
> send the message at 300 s as it was busy during the time .

According to the sshd_config(5) manpage, the server will close the
connection after ClientAliveCountMax messages (default value: 3) have
been sent.

I can't see how this can be caused by load. If you haven't yet enabled
ClientAliveInterval, then the connection isn't being closed by sshd
but by the kernel, due to TCP keep-alives not being acknowledged.

By default, the kernel doesn't start sending keep-alives until the
connection has been idle for 2 hours, after which it sends 9 probes at
an interval of 75 seconds, so the system would need to be
non-responsive for over 11 minutes. And the responses are generated by
the kernel, so they'll be sent even if the process is suspended.

As Michal suggests, the most likely reason for this is a NAT timeout. 
If you're using NAT, you probably want to set the keep-alive time
(/proc/sys/net/ipv4/tcp_keepalive_time) to a value less than the NAT
timeout. Even then, that will only work for programs which enable
keep-alive (ssh and sshd both do by default; this is controlled by the
TCPKeepAlive option).

-- 
Glynn Clements <glynn@gclements.plus.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Preventing SSH timeouts . Some clarification needed
  2010-06-08 10:39 ` Glynn Clements
@ 2010-06-08 15:10   ` query
  2010-06-08 16:19     ` Glynn Clements
  0 siblings, 1 reply; 16+ messages in thread
From: query @ 2010-06-08 15:10 UTC (permalink / raw)
  To: Glynn Clements; +Cc: linux-admin

Thanks for the suggestion .

On Tue, Jun 8, 2010 at 4:09 PM, Glynn Clements <glynn@gclements.plus.com> wrote:
>
> query wrote:
>
>> One work around I could see is adding a  timeout value using
>> ClientAliveInterval option in /etc/ssh/sshd_config on the server side
>> . Assume I have set the timeout value to 300 .
>>
>>
>> " The above option as per the sshd man page tells that it sets a
>> timeout interval in seconds after which if no data has been received
>> from the client, sshd(8) will send a message through the encrypted
>> channel to request a response from the client. "
>>
>> Let's take a situation where the SSH client is 100% busy or idle and
>> it had communicated to the server for around 300 seconds , then in
>> this case if the above option is there , the server should send a
>> message to the client after 300 secs . The following two scenarios are
>> coming to my mind.
>>
>> 1)  if the server is also 100% busy during that time and could not
>> send the message to the client , will the ssh connection will be
>> dropped .
>> 2) or Suppose the server was somewhat free after 350 secs , in that
>> case will it drop the connection or it will send a message to the
>> client to check whether the client is active or not since it could not
>> send the message at 300 s as it was busy during the time .
>
> According to the sshd_config(5) manpage, the server will close the
> connection after ClientAliveCountMax messages (default value: 3) have
> been sent.
>
> I can't see how this can be caused by load. If you haven't yet enabled
> ClientAliveInterval, then the connection isn't being closed by sshd
> but by the kernel, due to TCP keep-alives not being acknowledged.

okay...that may be the cause . The client host was also busy because
of which TCP keep-alive
were not acknowledged.


>
> By default, the kernel doesn't start sending keep-alives until the
> connection has been idle for 2 hours, after which it sends 9 probes at
> an interval of 75 seconds, so the system would need to be
> non-responsive for over 11 minutes. And the responses are generated by
> the kernel, so they'll be sent even if the process is suspended.
>
> As Michal suggests, the most likely reason for this is a NAT timeout.
> If you're using NAT, you probably want to set the keep-alive time
> (/proc/sys/net/ipv4/tcp_keepalive_time) to a value less than the NAT
> timeout. Even then, that will only work for programs which enable
> keep-alive (ssh and sshd both do by default; this is controlled by the
> TCPKeepAlive option).

How to determine the value of NAT timeout . Is it at the host level or
the device where NATing is implemented . I was able to
find the keepalive timeout value at the host .

====
$ sudo sysctl -a | grep -i keepalive
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75
=====

Most likely I am not behind NAT , I will confirm it tomorrow . If that
is the case , then which should I consider to increase the timeout
value.
The kernel timeout value or implement either TCPKeepAlive option or
the ClientAliveInterval interval . TCPKeepAlive option is somehow
disabled
in the sshd config file .  Please clarify regarding this.


>
> --
> Glynn Clements <glynn@gclements.plus.com>
>

Once again , thanks all

--Zaman
--
To unsubscribe from this list: send the line "unsubscribe linux-admin" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Preventing SSH timeouts . Some clarification needed
  2010-06-08 15:10   ` query
@ 2010-06-08 16:19     ` Glynn Clements
  2010-06-09  6:44       ` query
  0 siblings, 1 reply; 16+ messages in thread
From: Glynn Clements @ 2010-06-08 16:19 UTC (permalink / raw)
  To: query; +Cc: linux-admin

query wrote:

> > I can't see how this can be caused by load. If you haven't yet enabled
> > ClientAliveInterval, then the connection isn't being closed by sshd
> > but by the kernel, due to TCP keep-alives not being acknowledged.
> 
> okay...that may be the cause . The client host was also busy because
> of which TCP keep-alive were not acknowledged.

Load won't have any effect upon TCP keep-alives, as it's the kernel
which acknowledges keep-alive packets, not the user process.

Keep-alive allows you to detect that a host is unreachable (e.g. 
network failure, system crash, power failure, etc). It doesn't tell
you anything about an individual process.

> > As Michal suggests, the most likely reason for this is a NAT timeout.
> > If you're using NAT, you probably want to set the keep-alive time
> > (/proc/sys/net/ipv4/tcp_keepalive_time) to a value less than the NAT
> > timeout. Even then, that will only work for programs which enable
> > keep-alive (ssh and sshd both do by default; this is controlled by the
> > TCPKeepAlive option).
> 
> How to determine the value of NAT timeout . Is it at the host level or
> the device where NATing is implemented .

The device which performs NAT.

> I was able to find the keepalive timeout value at the host .
> 
> ====
> $ sudo sysctl -a | grep -i keepalive
> net.ipv4.tcp_keepalive_time = 7200
> net.ipv4.tcp_keepalive_probes = 9
> net.ipv4.tcp_keepalive_intvl = 75
> =====
> 
> Most likely I am not behind NAT , I will confirm it tomorrow . If that
> is the case , then which should I consider to increase the timeout
> value.
> The kernel timeout value or implement either TCPKeepAlive option or
> the ClientAliveInterval interval . TCPKeepAlive option is somehow
> disabled in the sshd config file .  Please clarify regarding this.

TCPKeepAlive is enabled by default. But even if it's enabled, the
2-hour wait before any keep-alives are sent typically won't be enough
to prevent NAT entries from expiring.

Even the 5-minute interval between SSH keep-alives may be longer than
the NAT expiry time. Low-end router/modem devices with built-in NAT
seem base their default configuration on the assumption that you're
using HTTP from Win95 boxes, where a connection being idle for more
than 30 seconds usually means that the Win95 box has crashed.

Another possibility is a really cheap ISP which uses (a heavily
oversubscribed pool of) dynamic IP addresses, which expire whenever
the connection is idle for more than a minute.

-- 
Glynn Clements <glynn@gclements.plus.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Preventing SSH timeouts . Some clarification needed
  2010-06-08 16:19     ` Glynn Clements
@ 2010-06-09  6:44       ` query
  2010-06-09  8:15         ` Adam T. Bowen
  2010-06-09 10:14         ` Glynn Clements
  0 siblings, 2 replies; 16+ messages in thread
From: query @ 2010-06-09  6:44 UTC (permalink / raw)
  To: linux-admin; +Cc: Glynn Clements, Michal Nazarewicz

Guys , since we are clear now that we are not behind NAT , so we can
forget now about  reducing the keep-alive time
(/proc/sys/net/ipv4/tcp_keepalive_time) to a value less than the NAT
timeout.  But anyways , I learn something new :) .
The most likely reason which Michael also agreed can be the high load
on both the system .

So, do you suggest now to enable to enable the ClientAliveInterval
option . Also , since ClientAliveCountMax is enabled by default with a
value of 3 ,
so probably I will keep the value of ClientAliveInterval less than 300
secs . I will probably keep it at 60 secs. So , the connection will
dropout after 180  secs if there is no response .

Also , somewhat strange , TCPKeepAlive option is disabled in our
sshd_config file , not sure why . So , If  ClientAliveInterval is
enabled , can we can leave   TCPKeepAlive disabled . Is our purpose
will serve ?


Thanks
Zaman

On Tue, Jun 8, 2010 at 9:49 PM, Glynn Clements <glynn@gclements.plus.com> wrote:
>
> query wrote:
>
>> > I can't see how this can be caused by load. If you haven't yet enabled
>> > ClientAliveInterval, then the connection isn't being closed by sshd
>> > but by the kernel, due to TCP keep-alives not being acknowledged.
>>
>> okay...that may be the cause . The client host was also busy because
>> of which TCP keep-alive were not acknowledged.
>
> Load won't have any effect upon TCP keep-alives, as it's the kernel
> which acknowledges keep-alive packets, not the user process.
>
> Keep-alive allows you to detect that a host is unreachable (e.g.
> network failure, system crash, power failure, etc). It doesn't tell
> you anything about an individual process.
>
>> > As Michal suggests, the most likely reason for this is a NAT timeout.
>> > If you're using NAT, you probably want to set the keep-alive time
>> > (/proc/sys/net/ipv4/tcp_keepalive_time) to a value less than the NAT
>> > timeout. Even then, that will only work for programs which enable
>> > keep-alive (ssh and sshd both do by default; this is controlled by the
>> > TCPKeepAlive option).
>>
>> How to determine the value of NAT timeout . Is it at the host level or
>> the device where NATing is implemented .
>
> The device which performs NAT.
>
>> I was able to find the keepalive timeout value at the host .
>>
>> ====
>> $ sudo sysctl -a | grep -i keepalive
>> net.ipv4.tcp_keepalive_time = 7200
>> net.ipv4.tcp_keepalive_probes = 9
>> net.ipv4.tcp_keepalive_intvl = 75
>> =====
>>
>> Most likely I am not behind NAT , I will confirm it tomorrow . If that
>> is the case , then which should I consider to increase the timeout
>> value.
>> The kernel timeout value or implement either TCPKeepAlive option or
>> the ClientAliveInterval interval . TCPKeepAlive option is somehow
>> disabled in the sshd config file .  Please clarify regarding this.
>
> TCPKeepAlive is enabled by default. But even if it's enabled, the
> 2-hour wait before any keep-alives are sent typically won't be enough
> to prevent NAT entries from expiring.
>
> Even the 5-minute interval between SSH keep-alives may be longer than
> the NAT expiry time. Low-end router/modem devices with built-in NAT
> seem base their default configuration on the assumption that you're
> using HTTP from Win95 boxes, where a connection being idle for more
> than 30 seconds usually means that the Win95 box has crashed.
>
> Another possibility is a really cheap ISP which uses (a heavily
> oversubscribed pool of) dynamic IP addresses, which expire whenever
> the connection is idle for more than a minute.
>
> --
> Glynn Clements <glynn@gclements.plus.com>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-admin" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Preventing SSH timeouts . Some clarification needed
  2010-06-09  6:44       ` query
@ 2010-06-09  8:15         ` Adam T. Bowen
  2010-06-09 10:14         ` Glynn Clements
  1 sibling, 0 replies; 16+ messages in thread
From: Adam T. Bowen @ 2010-06-09  8:15 UTC (permalink / raw)
  To: query; +Cc: linux-admin

[-- Attachment #1: Type: text/plain, Size: 4453 bytes --]

Hi,

Another setting to try, on the client side, is the ServerAliveInterval.
This sets a keep alive packet to be sent within the SSH protocol, as
opposed to TCPKeepAlive which is within the underlying TCP connection. I
have had the misfortune to be behind firewalls that have harvested
"dead" connections far too quickly, in my opinion, and this setting
worked for me where TCPKeepAlive didn't. Worth a try.

Cheers,

Adam

On 09/06/10 07:44, query wrote:
> Guys , since we are clear now that we are not behind NAT , so we can
> forget now about  reducing the keep-alive time
> (/proc/sys/net/ipv4/tcp_keepalive_time) to a value less than the NAT
> timeout.  But anyways , I learn something new :) .
> The most likely reason which Michael also agreed can be the high load
> on both the system .
> 
> So, do you suggest now to enable to enable the ClientAliveInterval
> option . Also , since ClientAliveCountMax is enabled by default with a
> value of 3 ,
> so probably I will keep the value of ClientAliveInterval less than 300
> secs . I will probably keep it at 60 secs. So , the connection will
> dropout after 180  secs if there is no response .
> 
> Also , somewhat strange , TCPKeepAlive option is disabled in our
> sshd_config file , not sure why . So , If  ClientAliveInterval is
> enabled , can we can leave   TCPKeepAlive disabled . Is our purpose
> will serve ?
> 
> 
> Thanks
> Zaman
> 
> On Tue, Jun 8, 2010 at 9:49 PM, Glynn Clements <glynn@gclements.plus.com> wrote:
>>
>> query wrote:
>>
>>>> I can't see how this can be caused by load. If you haven't yet enabled
>>>> ClientAliveInterval, then the connection isn't being closed by sshd
>>>> but by the kernel, due to TCP keep-alives not being acknowledged.
>>>
>>> okay...that may be the cause . The client host was also busy because
>>> of which TCP keep-alive were not acknowledged.
>>
>> Load won't have any effect upon TCP keep-alives, as it's the kernel
>> which acknowledges keep-alive packets, not the user process.
>>
>> Keep-alive allows you to detect that a host is unreachable (e.g.
>> network failure, system crash, power failure, etc). It doesn't tell
>> you anything about an individual process.
>>
>>>> As Michal suggests, the most likely reason for this is a NAT timeout.
>>>> If you're using NAT, you probably want to set the keep-alive time
>>>> (/proc/sys/net/ipv4/tcp_keepalive_time) to a value less than the NAT
>>>> timeout. Even then, that will only work for programs which enable
>>>> keep-alive (ssh and sshd both do by default; this is controlled by the
>>>> TCPKeepAlive option).
>>>
>>> How to determine the value of NAT timeout . Is it at the host level or
>>> the device where NATing is implemented .
>>
>> The device which performs NAT.
>>
>>> I was able to find the keepalive timeout value at the host .
>>>
>>> ====
>>> $ sudo sysctl -a | grep -i keepalive
>>> net.ipv4.tcp_keepalive_time = 7200
>>> net.ipv4.tcp_keepalive_probes = 9
>>> net.ipv4.tcp_keepalive_intvl = 75
>>> =====
>>>
>>> Most likely I am not behind NAT , I will confirm it tomorrow . If that
>>> is the case , then which should I consider to increase the timeout
>>> value.
>>> The kernel timeout value or implement either TCPKeepAlive option or
>>> the ClientAliveInterval interval . TCPKeepAlive option is somehow
>>> disabled in the sshd config file .  Please clarify regarding this.
>>
>> TCPKeepAlive is enabled by default. But even if it's enabled, the
>> 2-hour wait before any keep-alives are sent typically won't be enough
>> to prevent NAT entries from expiring.
>>
>> Even the 5-minute interval between SSH keep-alives may be longer than
>> the NAT expiry time. Low-end router/modem devices with built-in NAT
>> seem base their default configuration on the assumption that you're
>> using HTTP from Win95 boxes, where a connection being idle for more
>> than 30 seconds usually means that the Win95 box has crashed.
>>
>> Another possibility is a really cheap ISP which uses (a heavily
>> oversubscribed pool of) dynamic IP addresses, which expire whenever
>> the connection is idle for more than a minute.
>>
>> --
>> Glynn Clements <glynn@gclements.plus.com>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-admin" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 6452 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Preventing SSH timeouts . Some clarification needed
  2010-06-09  6:44       ` query
  2010-06-09  8:15         ` Adam T. Bowen
@ 2010-06-09 10:14         ` Glynn Clements
       [not found]           ` <AANLkTimDS_IalexVnOKtOuKN8fz13rFumHV8TrjEGtph@mail.gmail.com>
  1 sibling, 1 reply; 16+ messages in thread
From: Glynn Clements @ 2010-06-09 10:14 UTC (permalink / raw)
  To: query; +Cc: linux-admin, Michal Nazarewicz

query wrote:

> Guys , since we are clear now that we are not behind NAT , so we can
> forget now about  reducing the keep-alive time

Note that the same issue applies for firewalls which use connection
tracking to determine which packets to allow. Ultimately, it's the
connection tracking that's the issue, not NAT per se.

If the router "forgets" a connection because it hasn't seen any
traffic in a long time, and the result of this is that subsequent
packets are silently discarded, then the connection will cease to
work, resulting in a timeout occurring the next time either side tries
to send data.

This isn't an issue if connection tracking is only used for
scheduling.

> So, do you suggest now to enable to enable the ClientAliveInterval
> option . Also , since ClientAliveCountMax is enabled by default with a
> value of 3 ,
> so probably I will keep the value of ClientAliveInterval less than 300
> secs . I will probably keep it at 60 secs. So , the connection will
> dropout after 180  secs if there is no response .
> 
> Also , somewhat strange , TCPKeepAlive option is disabled in our
> sshd_config file , not sure why . So , If  ClientAliveInterval is
> enabled , can we can leave   TCPKeepAlive disabled . Is our purpose
> will serve ?

The main reason to disable TCP keep-alives is that they can cause a
connection to be dropped unnecessarily. A secondary reason is that
they will cause an on-demand link-layer connection to be opened
unnecessarily, but that's less of issue nowadays.

Without keep-alives, an idle TCP connection doesn't cause any packets
to be sent. The physical link could be down for days, but the TCP
connection will remain open so long as no packets are sent while the
link is down. Enabling keep-alives will cause the connection to fail
in this situation.

The main purpose of keep-alives is to prevent the situation where the
client system crashes, leaving the server process listening for
packets which will never arive. Without keep-alives, there's no way to
distinguish between a client which has permanently vanished and one
which has merely been idle for a long time.

The SSH keep-alives (ClientAliveInterval and ServerAliveInterval)
serve a similar purpose (to force a connection to be terminated when
the other end disappears without explicitly closing the connection),
except that the SSH protocol prevents spoofing. This prevents the
situation where an attacker "silences" one side of the connection and
spoofs TCP keep-alives to prevent the server from realising that
something has happened.

-- 
Glynn Clements <glynn@gclements.plus.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

[parent not found: <AANLkTimDS_IalexVnOKtOuKN8fz13rFumHV8TrjEGtph@mail.gmail.com>]

[parent not found: <19471.47290.566464.539451@cerise.gclements.plus.com>]

* Re: Preventing SSH timeouts . Some clarification needed
       [not found]             ` <19471.47290.566464.539451@cerise.gclements.plus.com>
@ 2010-06-10  6:02               ` query
  2010-06-10 13:03                 ` Glynn Clements
  0 siblings, 1 reply; 16+ messages in thread
From: query @ 2010-06-10  6:02 UTC (permalink / raw)
  To: Glynn Clements; +Cc: linux-admin

On Wed, Jun 9, 2010 at 9:22 PM, Glynn Clements <glynn@gclements.plus.com> wrote:
>
> query wrote:
>
>> okay..So ,what I can understand is that keep-alives or similar like
>> (ClientAliveInterval and ServerAliveInterval) options are never
>> going to help to prevent those timeouts . Enabling those options ,
>> will only adverse the situation .
>
> Not necessarily. If the problem is caused by connection tracking
> expiring the connection, keep-alives may prevent this from happening,
> although the default settings for TCP keep-alives are probably
> insufficient.
>
>> So , if the client host is busy for a long time and is not able to
>> send any messages to the SSH server , then the server will drop the
>> connection assuming that the client has crashed  for whatever reason
>> if keep-alives like options are enabled .
>
> Yes, for SSH keep-alives. TCP keep-alives are handled by the kernel,
> and only require that the host is functioning and connected. Even if
> the ssh or sshd processes were completely suspended (in the sense of
> "kill -STOP ..."), TCP keep-alives will continue to be sent and/or
> acknowledged.
>
>> On the other hand ,  if
>> keep-alive option is disabled , the server is never going to drop the
>> SSH connection even if the client crashes or 100% busy (  could not
>> send a message to the server) or idle . The SSH connection drop was
>> initiated by the kernel as you mentioned in your first comment and we
>> can do nothing on the SSH configurations  to avoid those timeouts.
>
> If the problem is due to connection tracking, enabling frequent
> keep-alives should prevent the connection from expiring. However, this
> can cause a connection to be dropped if the system is under heavy
> load, even if the connection is otherwise idle. The risk can be
> reduced by increasing the value for the ClientAliveCountMax or
> ServerAliveCountMax options, so that the connection is only dropped if
> the process stops responding for an extended period.

okay..Thanks for the clarification . Since the host sometimes
continues to remain busy for around 2 hours , so the
ClientAliveCountMax should be a high value in our case .

==========
                                     cpu      mem
Time                           %util      %util

06/07-23:00      -       - 100.0  17.4
06/07-23:30      -       - 100.0  18.1
06/08-00:00      -       - 100.0  18.0
06/08-00:30      -       - 100.0  17.4
=========


Since I am not sure of the connection tracking timeout value , So , I
am planning to put a value of (ClientAliveInterval and
ServerAliveInterval) to be 300 secs and
CountMax value to be 24 so that even in the worst case of high load ,
it continues to send message to the server so that the connection does
not break. Since in our case , both the client and server remains busy
at the same time , so I am planning to use the option on both the
client and server   , so that either of it can send a send a SSH keep
alive message to inform the router that the connection is alive.  But
I hope it will not add any extra load on the server since already the
CPU is 100% high .

Thanks
Zaman


>
> --
> Glynn Clements <glynn@gclements.plus.com>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-admin" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Preventing SSH timeouts . Some clarification needed
  2010-06-10  6:02               ` query
@ 2010-06-10 13:03                 ` Glynn Clements
  2010-06-10 16:35                   ` query
  0 siblings, 1 reply; 16+ messages in thread
From: Glynn Clements @ 2010-06-10 13:03 UTC (permalink / raw)
  To: query; +Cc: linux-admin

query wrote:

> okay..Thanks for the clarification . Since the host sometimes
> continues to remain busy for around 2 hours ,

Busy to the point that ssh/sshd doesn't get *any* CPU time for 2
hours? Either you're misunderstanding something, or that's a seriously
misconfigured server.

In general, processes which need a lot of CPU cycles should have a
lower priority than those which need little. The relative "importance"
of processes doesn't matter here. A system where the key process gets
95% CPU while support processes get the other 5% is preferable to one
where the key process gets 100% CPU and support processes are
suspended for long periods.

-- 
Glynn Clements <glynn@gclements.plus.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Preventing SSH timeouts . Some clarification needed
  2010-06-10 13:03                 ` Glynn Clements
@ 2010-06-10 16:35                   ` query
  2010-06-10 23:52                     ` Glynn Clements
  0 siblings, 1 reply; 16+ messages in thread
From: query @ 2010-06-10 16:35 UTC (permalink / raw)
  To: Glynn Clements; +Cc: linux-admin

On Thu, Jun 10, 2010 at 6:33 PM, Glynn Clements
<glynn@gclements.plus.com> wrote:
>
> query wrote:
>
>> okay..Thanks for the clarification . Since the host sometimes
>> continues to remain busy for around 2 hours ,
>
> Busy to the point that ssh/sshd doesn't get *any* CPU time for 2
> hours? Either you're misunderstanding something, or that's a seriously
> misconfigured server.

That is my misunderstanding only .The CPU is 100% busy but it is not
that all the 100% is being utilized by our process and no other
process is getting the CPU time.  I will calculate an optimal value by
going through once more over the system during the peak CPU
utilization .
But I am still confused who is terminating the connection in our case
and on how is calculating the timeout value.
AS you mentioned in your first comment that it the kernel who is
terminating the connection , but based on what it is terminating
the connection . As you said earlier , Keep-alive allows us to detect
that a host is unreachable (e.g.
network failure, system crash, power failure, etc) , It is not going
to kill sshd ,

Apologies for repeating the same question , but I am still confused
regarding this.

Thanks
Zaman

>
> In general, processes which need a lot of CPU cycles should have a
> lower priority than those which need little. The relative "importance"
> of processes doesn't matter here. A system where the key process gets
> 95% CPU while support processes get the other 5% is preferable to one
> where the key process gets 100% CPU and support processes are
> suspended for long periods.
>
> --
> Glynn Clements <glynn@gclements.plus.com>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Preventing SSH timeouts . Some clarification needed
  2010-06-10 16:35                   ` query
@ 2010-06-10 23:52                     ` Glynn Clements
  2010-06-11  7:22                       ` query
  0 siblings, 1 reply; 16+ messages in thread
From: Glynn Clements @ 2010-06-10 23:52 UTC (permalink / raw)
  To: query; +Cc: linux-admin


query wrote:

> >> okay..Thanks for the clarification . Since the host sometimes
> >> continues to remain busy for around 2 hours ,
> >
> > Busy to the point that ssh/sshd doesn't get *any* CPU time for 2
> > hours? Either you're misunderstanding something, or that's a seriously
> > misconfigured server.
> 
> That is my misunderstanding only .The CPU is 100% busy but it is not
> that all the 100% is being utilized by our process and no other
> process is getting the CPU time.  I will calculate an optimal value by
> going through once more over the system during the peak CPU
> utilization .
> But I am still confused who is terminating the connection in our case
> and on how is calculating the timeout value.
> AS you mentioned in your first comment that it the kernel who is
> terminating the connection , but based on what it is terminating
> the connection . As you said earlier , Keep-alive allows us to detect
> that a host is unreachable (e.g.
> network failure, system crash, power failure, etc) , It is not going
> to kill sshd ,

It won't kill sshd; however, if packets (data or keep-alives) which
are sent to the client stop being acknowledged, operations on the
socket will eventually fail with ETIMEDOUT. At this point, sshd will
close the connection of its own accord.

The relevant setting is /proc/sys/net/ipv4/tcp_retries2:

       tcp_retries2 (integer; default: 15; since Linux 2.2)
	      The maximum number of times a TCP	 packet	 is  retransmitted  in
	      established  state  before  giving up.  The default value is 15,
	      which corresponds to a duration of approximately between	13  to
	      30  minutes,  depending  on  the	retransmission	timeout.   The
	      RFC 1122 specified minimum limit of  100	seconds	 is  typically
	      deemed too short.

The initial retransmission timeout is determined by the measured
round-trip latency for the connection. Subsequent retransmissions
occur at exponentially increasing intervals, capped at 120 seconds.

If keep-alives aren't being sent, the connection can only time out as
a result of data being sent. If keep-alives are being sent, a timeout
can occur even if the connection is otherwise idle (that's the purpose
of keep-alives).

-- 
Glynn Clements <glynn@gclements.plus.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Preventing SSH timeouts . Some clarification needed
  2010-06-10 23:52                     ` Glynn Clements
@ 2010-06-11  7:22                       ` query
  0 siblings, 0 replies; 16+ messages in thread
From: query @ 2010-06-11  7:22 UTC (permalink / raw)
  To: Glynn Clements; +Cc: linux-admin

Clements , thanks for all the detailed explanation . I  think things
are clear to me now . Will try to apply the changes in sshd_config .

And Thanks Michael and all for providing insights on the issue .

Thanks
Zaman

On Fri, Jun 11, 2010 at 5:22 AM, Glynn Clements
<glynn@gclements.plus.com> wrote:
>
> query wrote:
>
>> >> okay..Thanks for the clarification . Since the host sometimes
>> >> continues to remain busy for around 2 hours ,
>> >
>> > Busy to the point that ssh/sshd doesn't get *any* CPU time for 2
>> > hours? Either you're misunderstanding something, or that's a seriously
>> > misconfigured server.
>>
>> That is my misunderstanding only .The CPU is 100% busy but it is not
>> that all the 100% is being utilized by our process and no other
>> process is getting the CPU time.  I will calculate an optimal value by
>> going through once more over the system during the peak CPU
>> utilization .
>> But I am still confused who is terminating the connection in our case
>> and on how is calculating the timeout value.
>> AS you mentioned in your first comment that it the kernel who is
>> terminating the connection , but based on what it is terminating
>> the connection . As you said earlier , Keep-alive allows us to detect
>> that a host is unreachable (e.g.
>> network failure, system crash, power failure, etc) , It is not going
>> to kill sshd ,
>
> It won't kill sshd; however, if packets (data or keep-alives) which
> are sent to the client stop being acknowledged, operations on the
> socket will eventually fail with ETIMEDOUT. At this point, sshd will
> close the connection of its own accord.
>
> The relevant setting is /proc/sys/net/ipv4/tcp_retries2:
>
>       tcp_retries2 (integer; default: 15; since Linux 2.2)
>              The maximum number of times a TCP  packet  is  retransmitted  in
>              established  state  before  giving up.  The default value is 15,
>              which corresponds to a duration of approximately between  13  to
>              30  minutes,  depending  on  the  retransmission  timeout.   The
>              RFC 1122 specified minimum limit of  100  seconds  is  typically
>              deemed too short.
>
> The initial retransmission timeout is determined by the measured
> round-trip latency for the connection. Subsequent retransmissions
> occur at exponentially increasing intervals, capped at 120 seconds.
>
> If keep-alives aren't being sent, the connection can only time out as
> a result of data being sent. If keep-alives are being sent, a timeout
> can occur even if the connection is otherwise idle (that's the purpose
> of keep-alives).
>
> --
> Glynn Clements <glynn@gclements.plus.com>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-admin" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2010-06-11  7:22 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-08  9:36 Preventing SSH timeouts . Some clarification needed query
2010-06-08  9:48 ` Michal Nazarewicz
     [not found]   ` <AANLkTimTjSOmbj_ac4iiUMaHRuvp1-ljW-FUGAQbb1qt@mail.gmail.com>
     [not found]     ` <87k4q9g31y.fsf@erwin.mina86.com>
2010-06-08 15:10       ` query
2010-06-08 19:48         ` Michal Nazarewicz
2010-06-09  5:33           ` query
2010-06-08 10:39 ` Glynn Clements
2010-06-08 15:10   ` query
2010-06-08 16:19     ` Glynn Clements
2010-06-09  6:44       ` query
2010-06-09  8:15         ` Adam T. Bowen
2010-06-09 10:14         ` Glynn Clements
     [not found]           ` <AANLkTimDS_IalexVnOKtOuKN8fz13rFumHV8TrjEGtph@mail.gmail.com>
     [not found]             ` <19471.47290.566464.539451@cerise.gclements.plus.com>
2010-06-10  6:02               ` query
2010-06-10 13:03                 ` Glynn Clements
2010-06-10 16:35                   ` query
2010-06-10 23:52                     ` Glynn Clements
2010-06-11  7:22                       ` query

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.