TCP default settings (bugzilla)

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* TCP default settings (bugzilla)
@ 2026-04-15 14:14 Stephen Hemminger
  0 siblings, 0 replies; 4+ messages in thread
From: Stephen Hemminger @ 2026-04-15 14:14 UTC (permalink / raw)
  To: netdev

A pair of TCP configuration related bug reports just showed up in bugzilla.
Getting the right time values here seems like a trade off between fast
failover and not dropping crappy connections.

Given how well formatted the buts are they look AI generated.

https://bugzilla.kernel.org/show_bug.cgi?id=221366

The default value of net.ipv4.tcp_retries2 (15 retries, resulting in
~924 seconds / ~15.4 minutes before TCP abandons a dead connection) is
far too high for modern data center environments. When a remote host
becomes unreachable (server crash, failover, network partition),
applications are stuck for up to 16 minutes before receiving an error
and taking recovery action. This causes cascading failures, connection
pool exhaustion, and prolonged service outages.

https://bugzilla.kernel.org/show_bug.cgi?id=221365

The default value of net.ipv4.tcp_keepalive_time (7200 seconds / 2
hours) is incompatible with virtually all modern network
infrastructure, causing silent connection failures. Intermediate
stateful devices (load balancers, firewalls, NAT gateways) routinely
expire idle TCP connections after 300-1800 seconds — long before the
first keepalive probe is ever sent.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: TCP default settings (bugzilla)
@ 2026-04-17  5:58 plantegg ren
  0 siblings, 0 replies; 4+ messages in thread
From: plantegg ren @ 2026-04-17  5:58 UTC (permalink / raw)
  To: stephen; +Cc: netdev

Hi Stephen,

  I'm the reporter of those two bugs. I'm a DBA and Linux SRE with over
  10 years at Alibaba Cloud (Aliyun).

  These come from real production pain, not just theory. During my time at
  Alibaba Cloud, I pushed to change the default tcp_retries2 from 15 to 7
  in Alibaba Cloud Linux 3 (ALinux3) — our in-house distro serving millions
  of ECS instances. That change alone eliminated a whole class of prolonged
  outages across the fleet.

  The most memorable case: MySQL crashed and restarted in seconds, but the
  application tier stayed down for ~16 minutes because all existing
  connections were stuck in retransmission. After changing tcp_retries2 from
  15 to 5, recovery time dropped from 957s to about 20s.

  The tcp_keepalive_time issue bit us through LVS — connections silently
  dropped after 900s of idle time, but TCP didn't notice until 7200s later.
  We spent days chasing "random" Connection Reset errors across dozens of
  services before tracing it to this mismatch.

  Every ops team I've talked to ends up applying these tweaks independently
  after getting burned. If a major cloud distro already ships tcp_retries2=7,
  maybe it's time for upstream to reconsider the default too.

  I did use AI to help format the bug reports (guilty as charged), but the
  problems and the data are from years of production experience.

  Thanks for forwarding to the list.

  Xijun Ren

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: TCP default settings (bugzilla)
@ 2026-04-17  7:01 plantegg ren
  2026-04-17  7:33 ` Willy Tarreau
  0 siblings, 1 reply; 4+ messages in thread
From: plantegg ren @ 2026-04-17  7:01 UTC (permalink / raw)
  To: stephen; +Cc: netdev

Hi,

One more real-world data point that just happened two weeks ago,
directly related to tcp_keepalive_time.

AWS recently rolled out Nitro V6 (8th-gen EC2 instances) which reduced
the ENI connection tracking timeout from 432000 seconds (5 days) to
just 350 seconds:

  https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/security-group-connection-tracking.html

Our MySQL/HikariCP connection pools started seeing intermittent timeout
errors every 20-30 minutes after migrating to 8th-gen instances. We
captured packets on both client and server simultaneously. Here is what
we found on a single connection (idle for 818 seconds, well past the
350-second ENI timeout):

Server side -- MySQL receives the request and sends responses normally:

  #270  71.51s  10.23.99.71 -> 172.20.64.240  [ACK]     last activity
                  ~~~ connection idle for 818 seconds ~~~
  #271 889.94s  10.23.99.71 -> 172.20.64.240  [PSH,ACK] len=5  client
request arrives
  #272 889.94s  172.20.64.240 -> 10.23.99.71  [PSH,ACK] len=11 server
responds OK
  #275 890.15s  172.20.64.240 -> 10.23.99.71  [PSH,ACK] len=11 server
retransmits
  #278 890.59s  172.20.64.240 -> 10.23.99.71  [PSH,ACK] len=11 server
retransmits
  #281 891.02s  172.20.64.240 -> 10.23.99.71  [PSH,ACK] len=11 server
retransmits
    ... (server keeps retransmitting, client never ACKs)

Client side -- sends request, but NEVER receives any server response:

  #267  71.51s  10.23.99.71 -> 172.20.64.240  [ACK]     last activity
                  ~~~ connection idle for 818 seconds ~~~
  #268 889.94s  10.23.99.71 -> 172.20.64.240  [PSH,ACK] len=5  sends request
  #269 890.15s  10.23.99.71 -> 172.20.64.240  [PSH,ACK] len=5  retransmit 1
  #270 890.37s  10.23.99.71 -> 172.20.64.240  [PSH,ACK] len=5  retransmit 2
  #271 890.79s  10.23.99.71 -> 172.20.64.240  [PSH,ACK] len=5  retransmit 3
  #272 891.65s  10.23.99.71 -> 172.20.64.240  [PSH,ACK] len=5  retransmit 4
  #273 893.38s  10.23.99.71 -> 172.20.64.240  [PSH,ACK] len=5  retransmit 5
  #274 894.94s  10.23.99.71 -> 172.20.64.240  [FIN,ACK]         gives up

  Zero packets from 172.20.64.240 after the idle gap. Zero RSTs.

The ENI silently drops all inbound packets (server -> client) because
the connection tracking entry expired after 350 seconds. Outbound
packets (client -> server) still pass through, so the server receives
the request and responds -- but its responses are black-holed by the
ENI. No RST is sent, so both sides are completely unaware.

If tcp_keepalive_time were lower than 350 seconds, the keepalive probes
would have kept the ENI tracking entry alive, and none of this would
have happened.

The trend is clear -- middlebox idle timeouts are getting shorter (AWS
went from 432000s to 350s overnight), while tcp_keepalive_time has
stayed at 7200 seconds for decades. The gap is widening.

Xijun

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: TCP default settings (bugzilla)
  2026-04-17  7:01 TCP default settings (bugzilla) plantegg ren
@ 2026-04-17  7:33 ` Willy Tarreau
  0 siblings, 0 replies; 4+ messages in thread
From: Willy Tarreau @ 2026-04-17  7:33 UTC (permalink / raw)
  To: plantegg ren; +Cc: stephen, netdev

On Fri, Apr 17, 2026 at 03:01:08PM +0800, plantegg ren wrote:
> Hi,
> 
> One more real-world data point that just happened two weeks ago,
> directly related to tcp_keepalive_time.
> 
> AWS recently rolled out Nitro V6 (8th-gen EC2 instances) which reduced
> the ENI connection tracking timeout from 432000 seconds (5 days) to
> just 350 seconds:
> 
>   https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/security-group-connection-tracking.html
> 
> Our MySQL/HikariCP connection pools started seeing intermittent timeout
> errors every 20-30 minutes after migrating to 8th-gen instances. We
> captured packets on both client and server simultaneously. Here is what
> we found on a single connection (idle for 818 seconds, well past the
> 350-second ENI timeout):
> 
> Server side -- MySQL receives the request and sends responses normally:
> 
>   #270  71.51s  10.23.99.71 -> 172.20.64.240  [ACK]     last activity
>                   ~~~ connection idle for 818 seconds ~~~
>   #271 889.94s  10.23.99.71 -> 172.20.64.240  [PSH,ACK] len=5  client
> request arrives
>   #272 889.94s  172.20.64.240 -> 10.23.99.71  [PSH,ACK] len=11 server
> responds OK
>   #275 890.15s  172.20.64.240 -> 10.23.99.71  [PSH,ACK] len=11 server
> retransmits
>   #278 890.59s  172.20.64.240 -> 10.23.99.71  [PSH,ACK] len=11 server
> retransmits
>   #281 891.02s  172.20.64.240 -> 10.23.99.71  [PSH,ACK] len=11 server
> retransmits
>     ... (server keeps retransmitting, client never ACKs)
> 
> Client side -- sends request, but NEVER receives any server response:
> 
>   #267  71.51s  10.23.99.71 -> 172.20.64.240  [ACK]     last activity
>                   ~~~ connection idle for 818 seconds ~~~
>   #268 889.94s  10.23.99.71 -> 172.20.64.240  [PSH,ACK] len=5  sends request
>   #269 890.15s  10.23.99.71 -> 172.20.64.240  [PSH,ACK] len=5  retransmit 1
>   #270 890.37s  10.23.99.71 -> 172.20.64.240  [PSH,ACK] len=5  retransmit 2
>   #271 890.79s  10.23.99.71 -> 172.20.64.240  [PSH,ACK] len=5  retransmit 3
>   #272 891.65s  10.23.99.71 -> 172.20.64.240  [PSH,ACK] len=5  retransmit 4
>   #273 893.38s  10.23.99.71 -> 172.20.64.240  [PSH,ACK] len=5  retransmit 5
>   #274 894.94s  10.23.99.71 -> 172.20.64.240  [FIN,ACK]         gives up
> 
>   Zero packets from 172.20.64.240 after the idle gap. Zero RSTs.
> 
> The ENI silently drops all inbound packets (server -> client) because
> the connection tracking entry expired after 350 seconds. Outbound
> packets (client -> server) still pass through, so the server receives
> the request and responds -- but its responses are black-holed by the
> ENI. No RST is sent, so both sides are completely unaware.
> 
> If tcp_keepalive_time were lower than 350 seconds, the keepalive probes
> would have kept the ENI tracking entry alive, and none of this would
> have happened.
> 
> The trend is clear -- middlebox idle timeouts are getting shorter (AWS
> went from 432000s to 350s overnight), while tcp_keepalive_time has
> stayed at 7200 seconds for decades. The gap is widening.

It's up to the application to configure the keepalive interval if it
is relying on long connections, it's done using TCP_KEEPINTVL, and if
you're dealing with an application that doesn't expose the setting,
you indeed still have access to the system-wide setting above.

It's been well-known for at least two decades that no middle box could
sanely keep idle connections forever with the amount of traffic they're
seeing. 25 years ago I was already tuning the conntrack timeouts for a
bank firewall that was dealing with only 6k connections per second so
as to stay within reasonable memory sizes while keeping a good quality
of service. There's nothing new here.

Willy

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-04-17  7:33 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-17  7:01 TCP default settings (bugzilla) plantegg ren
2026-04-17  7:33 ` Willy Tarreau
  -- strict thread matches above, loose matches on Subject: below --
2026-04-17  5:58 plantegg ren
2026-04-15 14:14 Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox