linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC net] tcp: Fix orphaned socket stalling indefinitely in FIN-WAIT-1
@ 2025-08-22  6:02 'MingMing Wang'
  2025-08-22  8:53 ` Eric Dumazet
  0 siblings, 1 reply; 4+ messages in thread
From: 'MingMing Wang' @ 2025-08-22  6:02 UTC (permalink / raw)
  To: edumazet, ncardwell, kuniyu, davem, dsahern, kuba, pabeni, horms,
	ycheng
  Cc: netdev, linux-kernel, MingMing Wang, Dust Li

From: MingMing Wang <mii.w@linux.alibaba.com>

An orphaned TCP socket can stall indefinitely in FIN-WAIT-1
if the following conditions are met:
1. net.ipv4.tcp_retries2 is set to a value ≤ 8;
2. The peer advertises a zero window, and the window never reopens.

Steps to reproduce:
1. Set up two instances with nmap installed: one will act as the server
   the other as the client
2. Execute on the server:
   a. lower rmem : `sysctl -w net.ipv4.tcp_rmem="16 32 32"`
   b. start a listener: `nc -l -p 1234`
3. Execute on the client:
   a. lower tcp_retries2: `sysctl -w net.ipv4.tcp_retries2=8`
   b. send pakcets: `cat /dev/zero | nc <server-ip> 1234`
   c. after five seconds, stop the process: `killall nc`
4. Execute on the server: `killall -STOP nc`
5. Expected abnormal result: using `ss` command, we'll notice that the
   client connection remains stuck in the FIN_WAIT1 state, and the
   backoff counter always be 8 and no longer increased, as shown below:
   ```
   FIN-WAIT-1 0      1389    172.16.0.2:50316    172.16.0.1:1234
         cubic wscale:2,7 rto:201 backoff:8 rtt:0.078/0.007 mss:36
		 ... other fields omitted ...
   ```
6. If we set tcp_retries2 to 15 and repeat the steps above, the FIN_WAIT1
   state will be forcefully reclaimed after about 5 minutes.

During the zero-window probe retry process, it will check whether the
current connection is alive or not. If the connection is not alive and
the counter of retries exceeds the maximum allowed `max_probes`, retry
process will be terminated.

In our case, when we set `net.ipv4.tcp_retries2` to 8 or a less value,
according to the current implementation, the `icsk->icsk_backoff` counter
will be capped at `net.ipv4.tcp_retries2`. The value calculated by
`inet_csk_rto_backoff` will always be too small, which means the
computed backoff duration will always be less than rto_max. As a result,
the alive check will always return true. The condition before the
`goto abort` statement is an logical AND condition, the abort branch
can never be reached.

So, the TCP retransmission backoff mechanism has two issues:

1. `icsk->icsk_backoff` should monotonically increase during probe
   transmission and, upon reaching the maximum backoff limit, the
   connection should be terminated. However, the backoff value itself
   must not be capped prematurely — it should only control when to abort.

2. The condition for orphaned connection abort was incorrectly based on
   connection liveness and probe count. It should instead consider whether
   the number of orphaned probes exceeds the intended limit.

To fix this, introduce a local variable `orphan_probes` to track orphan
probe attempts separately from `max_probes`, which is used for RTO
retransmissions. This decouples the two counters and prevents accidental
overwrites, ensuring correct timeout behavior for orphaned connections.

Fixes: b248230c34970 ("tcp: abort orphan sockets stalling on zero window probes")
Co-developed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Co-developed-by: MingMing Wang <mii.w@linux.alibaba.com>
Signed-off-by: MingMing Wang <mii.w@linux.alibaba.com>

---
We couldn't determine the rationale behind the following check in tcp_send_probe0():
```
if (icsk->icsk_backoff < READ_ONCE(net->ipv4.sysctl_tcp_retries2))
    icsk->icsk_backoff++;
```

This condition appears to be the root cause of the observed stall.
However, it has existed in the kernel for over 20 years — which suggests
there might be a historical or subtle reason for its presence.

We would greatly appreciate it if anyone could shed
---
 net/ipv4/tcp_output.c | 4 +---
 net/ipv4/tcp_timer.c  | 4 ++--
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index caf11920a878..21795d696e38 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -4385,7 +4385,6 @@ void tcp_send_probe0(struct sock *sk)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
-	struct net *net = sock_net(sk);
 	unsigned long timeout;
 	int err;
 
@@ -4401,8 +4400,7 @@ void tcp_send_probe0(struct sock *sk)
 
 	icsk->icsk_probes_out++;
 	if (err <= 0) {
-		if (icsk->icsk_backoff < READ_ONCE(net->ipv4.sysctl_tcp_retries2))
-			icsk->icsk_backoff++;
+		icsk->icsk_backoff++;
 		timeout = tcp_probe0_when(sk, tcp_rto_max(sk));
 	} else {
 		/* If packet was not sent due to local congestion,
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index a207877270fb..4dba2928e1bf 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -419,9 +419,9 @@ static void tcp_probe_timer(struct sock *sk)
 	if (sock_flag(sk, SOCK_DEAD)) {
 		unsigned int rto_max = tcp_rto_max(sk);
 		const bool alive = inet_csk_rto_backoff(icsk, rto_max) < rto_max;
+		int orphan_probes = tcp_orphan_retries(sk, alive);
 
-		max_probes = tcp_orphan_retries(sk, alive);
-		if (!alive && icsk->icsk_backoff >= max_probes)
+		if (!alive || icsk->icsk_backoff >= orphan_probes)
 			goto abort;
 		if (tcp_out_of_resources(sk, true))
 			return;
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-08-25  6:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-22  6:02 [RFC net] tcp: Fix orphaned socket stalling indefinitely in FIN-WAIT-1 'MingMing Wang'
2025-08-22  8:53 ` Eric Dumazet
2025-08-22 16:02   ` Yuchung Cheng
2025-08-25  6:11   ` MingMing Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).