Per-connection tcp_retries2 and RFC 1122 compliance

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Per-connection tcp_retries2 and RFC 1122 compliance
@ 2015-02-02 21:05 John Eckersberg
  2015-02-03 14:50 ` Neal Cardwell
  0 siblings, 1 reply; 4+ messages in thread
From: John Eckersberg @ 2015-02-02 21:05 UTC (permalink / raw)
  To: netdev

Greetings,

RFC 1122, section 4.2.3.5 "TCP Connection Failures", states:

  (d)  An application MUST be able to set the value for R2 for
       a particular connection.  For example, an interactive
       application might set R2 to "infinity," giving the user
       control over when to disconnect.

The R2 value referenced above is implemented as the tcp_retries2 sysctl.
However it seems that the only way to tune that value is via the global
sysctl knob.  In other words, there is no provided way to set it only
for a particular connection as RFC 1122 requires.

Could someone confirm that this is a legitimate bug/deficiency?  Or am I
just missing something?  If this is a real bug, I would be willing to
put a patch together to fix it although I will probably require some
handholding (this would be my first contribution to the kernel).

Thanks,
John

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Per-connection tcp_retries2 and RFC 1122 compliance
  2015-02-02 21:05 Per-connection tcp_retries2 and RFC 1122 compliance John Eckersberg
@ 2015-02-03 14:50 ` Neal Cardwell
  2015-02-03 18:11   ` John Eckersberg
  0 siblings, 1 reply; 4+ messages in thread
From: Neal Cardwell @ 2015-02-03 14:50 UTC (permalink / raw)
  To: John Eckersberg; +Cc: Netdev

On Mon, Feb 2, 2015 at 4:05 PM, John Eckersberg <jeckersb@redhat.com> wrote:
> Greetings,
>
> RFC 1122, section 4.2.3.5 "TCP Connection Failures", states:
>
>   (d)  An application MUST be able to set the value for R2 for
>        a particular connection.  For example, an interactive
>        application might set R2 to "infinity," giving the user
>        control over when to disconnect.
>
> The R2 value referenced above is implemented as the tcp_retries2 sysctl.
> However it seems that the only way to tune that value is via the global
> sysctl knob.  In other words, there is no provided way to set it only
> for a particular connection as RFC 1122 requires.
>
> Could someone confirm that this is a legitimate bug/deficiency?  Or am I
> just missing something?

I believe the functionality you are looking  for is the
TCP_USER_TIMEOUT socket option:

commit dca43c75e7e545694a9dd6288553f55c53e2a3a3
Author: Jerry Chu <hkchu@google.com>
Date:   Fri Aug 27 19:13:28 2010 +0000

    tcp: Add TCP_USER_TIMEOUT socket option.

    This patch provides a "user timeout" support as described in RFC793. The
    socket option is also needed for the the local half of RFC5482 "TCP User
    Timeout Option".

    TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
    when > 0, to specify the maximum amount of time in ms that transmitted
    data may remain unacknowledged before TCP will forcefully close the
    corresponding connection and return ETIMEDOUT to the application. If
    0 is given, TCP will continue to use the system default.

    Increasing the user timeouts allows a TCP connection to survive extended
    periods without end-to-end connectivity. Decreasing the user timeouts
    allows applications to "fail fast" if so desired. Otherwise it may take
    upto 20 minutes with the current system defaults in a normal WAN
    environment.
    ....

Note how tcp_write_timeout() can pass in both sysctl_tcp_retries2 and
icsk->icsk_user_timeout to retransmits_timed_out(), and the
icsk->icsk_user_timeout value is used (if non-zero) in preference to
sysctl_tcp_retries2.

neal

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Per-connection tcp_retries2 and RFC 1122 compliance
  2015-02-03 14:50 ` Neal Cardwell
@ 2015-02-03 18:11   ` John Eckersberg
  2015-02-03 23:15     ` Willy Tarreau
  0 siblings, 1 reply; 4+ messages in thread
From: John Eckersberg @ 2015-02-03 18:11 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: Netdev

Neal Cardwell <ncardwell@google.com> writes:
> I believe the functionality you are looking  for is the
> TCP_USER_TIMEOUT socket option:

I had tried that previously, and it did not help my case.  The reason
why is that I was using a downstream kernel (Fedora 21, 3.17.8 in this
case) and it was missing this commit that went into 3.18:

commit b248230c34970a6c1c17c591d63b464e8d2cfc33
Author: Yuchung Cheng <ycheng@google.com>
Date:   Mon Sep 29 13:20:38 2014 -0700

    tcp: abort orphan sockets stalling on zero window probes
    
    Currently we have two different policies for orphan sockets
    that repeatedly stall on zero window ACKs. If a socket gets
    a zero window ACK when it is transmitting data, the RTO is
    used to probe the window. The socket is aborted after roughly
    tcp_orphan_retries() retries (as in tcp_write_timeout()).
    
    But if the socket was idle when it received the zero window ACK,
    and later wants to send more data, we use the probe timer to
    probe the window. If the receiver always returns zero window ACKs,
    icsk_probes keeps getting reset in tcp_ack() and the orphan socket
    can stall forever until the system reaches the orphan limit (as
    commented in tcp_probe_timer()). This opens up a simple attack
    to create lots of hanging orphan sockets to burn the memory
    and the CPU, as demonstrated in the recent netdev post "TCP
    connection will hang in FIN_WAIT1 after closing if zero window is
    advertised." http://www.spinics.net/lists/netdev/msg296539.html
    
    This patch follows the design in RTO-based probe: we abort an orphan
    socket stalling on zero window when the probe timer reaches both
    the maximum backoff and the maximum RTO. For example, an 100ms RTT
    connection will timeout after roughly 153 seconds (0.3 + 0.6 +
    .... + 76.8) if the receiver keeps the window shut. If the orphan
    socket passes this check, but the system already has too many orphans
    (as in tcp_out_of_resources()), we still abort it but we'll also
    send an RST packet as the connection may still be active.
    
    In addition, we change TCP_USER_TIMEOUT to cover (life or dead)
    sockets stalled on zero-window probes. This changes the semantics
    of TCP_USER_TIMEOUT slightly because it previously only applies
    when the socket has pending transmission.

The key part being that last paragraph about stalled zero-window
probes.  Here's the specific use case where I'm hitting this:

(1) Establish a TCP connection bound to a given IP address
(2) Remove IP address from host
(3) Write to socket

This gets kicked back by the IP layer as non-routable, which triggers
the same behavior as the zero-window probes.

The good news is, I confirmed this is working as expected when I tested
on 3.19.0-rc7.

Thanks for the pointer, I'll go take my harassment to the relevant
downstream folks.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Per-connection tcp_retries2 and RFC 1122 compliance
  2015-02-03 18:11   ` John Eckersberg
@ 2015-02-03 23:15     ` Willy Tarreau
  0 siblings, 0 replies; 4+ messages in thread
From: Willy Tarreau @ 2015-02-03 23:15 UTC (permalink / raw)
  To: David Miller; +Cc: John Eckersberg, Neal Cardwell, Yuchung Cheng, Netdev

Hi David,

do you think we could have the fix below queued for -stable ? It
appears to fix some quite annoying issues that are not easy to
debug.

Thanks,
Willy

On Tue, Feb 03, 2015 at 01:11:46PM -0500, John Eckersberg wrote:
> Neal Cardwell <ncardwell@google.com> writes:
> > I believe the functionality you are looking  for is the
> > TCP_USER_TIMEOUT socket option:
> 
> I had tried that previously, and it did not help my case.  The reason
> why is that I was using a downstream kernel (Fedora 21, 3.17.8 in this
> case) and it was missing this commit that went into 3.18:
> 
> commit b248230c34970a6c1c17c591d63b464e8d2cfc33
> Author: Yuchung Cheng <ycheng@google.com>
> Date:   Mon Sep 29 13:20:38 2014 -0700
> 
>     tcp: abort orphan sockets stalling on zero window probes
>     
>     Currently we have two different policies for orphan sockets
>     that repeatedly stall on zero window ACKs. If a socket gets
>     a zero window ACK when it is transmitting data, the RTO is
>     used to probe the window. The socket is aborted after roughly
>     tcp_orphan_retries() retries (as in tcp_write_timeout()).
>     
>     But if the socket was idle when it received the zero window ACK,
>     and later wants to send more data, we use the probe timer to
>     probe the window. If the receiver always returns zero window ACKs,
>     icsk_probes keeps getting reset in tcp_ack() and the orphan socket
>     can stall forever until the system reaches the orphan limit (as
>     commented in tcp_probe_timer()). This opens up a simple attack
>     to create lots of hanging orphan sockets to burn the memory
>     and the CPU, as demonstrated in the recent netdev post "TCP
>     connection will hang in FIN_WAIT1 after closing if zero window is
>     advertised." http://www.spinics.net/lists/netdev/msg296539.html
>     
>     This patch follows the design in RTO-based probe: we abort an orphan
>     socket stalling on zero window when the probe timer reaches both
>     the maximum backoff and the maximum RTO. For example, an 100ms RTT
>     connection will timeout after roughly 153 seconds (0.3 + 0.6 +
>     .... + 76.8) if the receiver keeps the window shut. If the orphan
>     socket passes this check, but the system already has too many orphans
>     (as in tcp_out_of_resources()), we still abort it but we'll also
>     send an RST packet as the connection may still be active.
>     
>     In addition, we change TCP_USER_TIMEOUT to cover (life or dead)
>     sockets stalled on zero-window probes. This changes the semantics
>     of TCP_USER_TIMEOUT slightly because it previously only applies
>     when the socket has pending transmission.
> 
> The key part being that last paragraph about stalled zero-window
> probes.  Here's the specific use case where I'm hitting this:
> 
> (1) Establish a TCP connection bound to a given IP address
> (2) Remove IP address from host
> (3) Write to socket
> 
> This gets kicked back by the IP layer as non-routable, which triggers
> the same behavior as the zero-window probes.
> 
> The good news is, I confirmed this is working as expected when I tested
> on 3.19.0-rc7.
> 
> Thanks for the pointer, I'll go take my harassment to the relevant
> downstream folks.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-02-03 23:15 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-02 21:05 Per-connection tcp_retries2 and RFC 1122 compliance John Eckersberg
2015-02-03 14:50 ` Neal Cardwell
2015-02-03 18:11   ` John Eckersberg
2015-02-03 23:15     ` Willy Tarreau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).