linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC net] tcp: Fix orphaned socket stalling indefinitely in FIN-WAIT-1
@ 2025-08-22  6:02 'MingMing Wang'
  2025-08-22  8:53 ` Eric Dumazet
  0 siblings, 1 reply; 4+ messages in thread
From: 'MingMing Wang' @ 2025-08-22  6:02 UTC (permalink / raw)
  To: edumazet, ncardwell, kuniyu, davem, dsahern, kuba, pabeni, horms,
	ycheng
  Cc: netdev, linux-kernel, MingMing Wang, Dust Li

From: MingMing Wang <mii.w@linux.alibaba.com>

An orphaned TCP socket can stall indefinitely in FIN-WAIT-1
if the following conditions are met:
1. net.ipv4.tcp_retries2 is set to a value ≤ 8;
2. The peer advertises a zero window, and the window never reopens.

Steps to reproduce:
1. Set up two instances with nmap installed: one will act as the server
   the other as the client
2. Execute on the server:
   a. lower rmem : `sysctl -w net.ipv4.tcp_rmem="16 32 32"`
   b. start a listener: `nc -l -p 1234`
3. Execute on the client:
   a. lower tcp_retries2: `sysctl -w net.ipv4.tcp_retries2=8`
   b. send pakcets: `cat /dev/zero | nc <server-ip> 1234`
   c. after five seconds, stop the process: `killall nc`
4. Execute on the server: `killall -STOP nc`
5. Expected abnormal result: using `ss` command, we'll notice that the
   client connection remains stuck in the FIN_WAIT1 state, and the
   backoff counter always be 8 and no longer increased, as shown below:
   ```
   FIN-WAIT-1 0      1389    172.16.0.2:50316    172.16.0.1:1234
         cubic wscale:2,7 rto:201 backoff:8 rtt:0.078/0.007 mss:36
		 ... other fields omitted ...
   ```
6. If we set tcp_retries2 to 15 and repeat the steps above, the FIN_WAIT1
   state will be forcefully reclaimed after about 5 minutes.

During the zero-window probe retry process, it will check whether the
current connection is alive or not. If the connection is not alive and
the counter of retries exceeds the maximum allowed `max_probes`, retry
process will be terminated.

In our case, when we set `net.ipv4.tcp_retries2` to 8 or a less value,
according to the current implementation, the `icsk->icsk_backoff` counter
will be capped at `net.ipv4.tcp_retries2`. The value calculated by
`inet_csk_rto_backoff` will always be too small, which means the
computed backoff duration will always be less than rto_max. As a result,
the alive check will always return true. The condition before the
`goto abort` statement is an logical AND condition, the abort branch
can never be reached.

So, the TCP retransmission backoff mechanism has two issues:

1. `icsk->icsk_backoff` should monotonically increase during probe
   transmission and, upon reaching the maximum backoff limit, the
   connection should be terminated. However, the backoff value itself
   must not be capped prematurely — it should only control when to abort.

2. The condition for orphaned connection abort was incorrectly based on
   connection liveness and probe count. It should instead consider whether
   the number of orphaned probes exceeds the intended limit.

To fix this, introduce a local variable `orphan_probes` to track orphan
probe attempts separately from `max_probes`, which is used for RTO
retransmissions. This decouples the two counters and prevents accidental
overwrites, ensuring correct timeout behavior for orphaned connections.

Fixes: b248230c34970 ("tcp: abort orphan sockets stalling on zero window probes")
Co-developed-by: Dust Li <dust.li@linux.alibaba.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Co-developed-by: MingMing Wang <mii.w@linux.alibaba.com>
Signed-off-by: MingMing Wang <mii.w@linux.alibaba.com>

---
We couldn't determine the rationale behind the following check in tcp_send_probe0():
```
if (icsk->icsk_backoff < READ_ONCE(net->ipv4.sysctl_tcp_retries2))
    icsk->icsk_backoff++;
```

This condition appears to be the root cause of the observed stall.
However, it has existed in the kernel for over 20 years — which suggests
there might be a historical or subtle reason for its presence.

We would greatly appreciate it if anyone could shed
---
 net/ipv4/tcp_output.c | 4 +---
 net/ipv4/tcp_timer.c  | 4 ++--
 2 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index caf11920a878..21795d696e38 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -4385,7 +4385,6 @@ void tcp_send_probe0(struct sock *sk)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
-	struct net *net = sock_net(sk);
 	unsigned long timeout;
 	int err;
 
@@ -4401,8 +4400,7 @@ void tcp_send_probe0(struct sock *sk)
 
 	icsk->icsk_probes_out++;
 	if (err <= 0) {
-		if (icsk->icsk_backoff < READ_ONCE(net->ipv4.sysctl_tcp_retries2))
-			icsk->icsk_backoff++;
+		icsk->icsk_backoff++;
 		timeout = tcp_probe0_when(sk, tcp_rto_max(sk));
 	} else {
 		/* If packet was not sent due to local congestion,
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index a207877270fb..4dba2928e1bf 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -419,9 +419,9 @@ static void tcp_probe_timer(struct sock *sk)
 	if (sock_flag(sk, SOCK_DEAD)) {
 		unsigned int rto_max = tcp_rto_max(sk);
 		const bool alive = inet_csk_rto_backoff(icsk, rto_max) < rto_max;
+		int orphan_probes = tcp_orphan_retries(sk, alive);
 
-		max_probes = tcp_orphan_retries(sk, alive);
-		if (!alive && icsk->icsk_backoff >= max_probes)
+		if (!alive || icsk->icsk_backoff >= orphan_probes)
 			goto abort;
 		if (tcp_out_of_resources(sk, true))
 			return;
-- 
2.46.0


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [RFC net] tcp: Fix orphaned socket stalling indefinitely in FIN-WAIT-1
  2025-08-22  6:02 [RFC net] tcp: Fix orphaned socket stalling indefinitely in FIN-WAIT-1 'MingMing Wang'
@ 2025-08-22  8:53 ` Eric Dumazet
  2025-08-22 16:02   ` Yuchung Cheng
  2025-08-25  6:11   ` MingMing Wang
  0 siblings, 2 replies; 4+ messages in thread
From: Eric Dumazet @ 2025-08-22  8:53 UTC (permalink / raw)
  To: MingMing Wang
  Cc: ncardwell, kuniyu, davem, dsahern, kuba, pabeni, horms, ycheng,
	netdev, linux-kernel, Dust Li

On Thu, Aug 21, 2025 at 11:04 PM MingMing Wang <mii.w@linux.alibaba.com> wrote:
>
> From: MingMing Wang <mii.w@linux.alibaba.com>
>
> An orphaned TCP socket can stall indefinitely in FIN-WAIT-1
> if the following conditions are met:
> 1. net.ipv4.tcp_retries2 is set to a value ≤ 8;
> 2. The peer advertises a zero window, and the window never reopens.
>
> Steps to reproduce:
> 1. Set up two instances with nmap installed: one will act as the server
>    the other as the client
> 2. Execute on the server:
>    a. lower rmem : `sysctl -w net.ipv4.tcp_rmem="16 32 32"`
>    b. start a listener: `nc -l -p 1234`
> 3. Execute on the client:
>    a. lower tcp_retries2: `sysctl -w net.ipv4.tcp_retries2=8`
>    b. send pakcets: `cat /dev/zero | nc <server-ip> 1234`
>    c. after five seconds, stop the process: `killall nc`
> 4. Execute on the server: `killall -STOP nc`
> 5. Expected abnormal result: using `ss` command, we'll notice that the
>    client connection remains stuck in the FIN_WAIT1 state, and the
>    backoff counter always be 8 and no longer increased, as shown below:
>    ```

Hi MingMing

Please prepare and share with us a packetdrill test, instead of this
'repro', which is the old way of describing things :/

- This will be easier for us to understand the issue.

- It will be added to existing tests in tools/testing/selftests/net/packetdrill
if your patch is accepted, so that we can make sure future changes are
not breaking this again.

Ideally, you should attach this packetdrill test in a second patch
(thus sending a series of two patches)

Thank you.

>    FIN-WAIT-1 0      1389    172.16.0.2:50316    172.16.0.1:1234
>          cubic wscale:2,7 rto:201 backoff:8 rtt:0.078/0.007 mss:36
>                  ... other fields omitted ...
>    ```
> 6. If we set tcp_retries2 to 15 and repeat the steps above, the FIN_WAIT1
>    state will be forcefully reclaimed after about 5 minutes.
>
> During the zero-window probe retry process, it will check whether the
> current connection is alive or not. If the connection is not alive and
> the counter of retries exceeds the maximum allowed `max_probes`, retry
> process will be terminated.
>
> In our case, when we set `net.ipv4.tcp_retries2` to 8 or a less value,
> according to the current implementation, the `icsk->icsk_backoff` counter
> will be capped at `net.ipv4.tcp_retries2`. The value calculated by
> `inet_csk_rto_backoff` will always be too small, which means the
> computed backoff duration will always be less than rto_max. As a result,
> the alive check will always return true. The condition before the
> `goto abort` statement is an logical AND condition, the abort branch
> can never be reached.
>
> So, the TCP retransmission backoff mechanism has two issues:
>
> 1. `icsk->icsk_backoff` should monotonically increase during probe
>    transmission and, upon reaching the maximum backoff limit, the
>    connection should be terminated. However, the backoff value itself
>    must not be capped prematurely — it should only control when to abort.
>
> 2. The condition for orphaned connection abort was incorrectly based on
>    connection liveness and probe count. It should instead consider whether
>    the number of orphaned probes exceeds the intended limit.
>
> To fix this, introduce a local variable `orphan_probes` to track orphan
> probe attempts separately from `max_probes`, which is used for RTO
> retransmissions. This decouples the two counters and prevents accidental
> overwrites, ensuring correct timeout behavior for orphaned connections.
>
> Fixes: b248230c34970 ("tcp: abort orphan sockets stalling on zero window probes")
> Co-developed-by: Dust Li <dust.li@linux.alibaba.com>
> Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
> Co-developed-by: MingMing Wang <mii.w@linux.alibaba.com>
> Signed-off-by: MingMing Wang <mii.w@linux.alibaba.com>
>
> ---
> We couldn't determine the rationale behind the following check in tcp_send_probe0():
> ```
> if (icsk->icsk_backoff < READ_ONCE(net->ipv4.sysctl_tcp_retries2))
>     icsk->icsk_backoff++;
> ```
>
> This condition appears to be the root cause of the observed stall.
> However, it has existed in the kernel for over 20 years — which suggests
> there might be a historical or subtle reason for its presence.
>
> We would greatly appreciate it if anyone could shed
> ---
>  net/ipv4/tcp_output.c | 4 +---
>  net/ipv4/tcp_timer.c  | 4 ++--
>  2 files changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index caf11920a878..21795d696e38 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -4385,7 +4385,6 @@ void tcp_send_probe0(struct sock *sk)
>  {
>         struct inet_connection_sock *icsk = inet_csk(sk);
>         struct tcp_sock *tp = tcp_sk(sk);
> -       struct net *net = sock_net(sk);
>         unsigned long timeout;
>         int err;
>
> @@ -4401,8 +4400,7 @@ void tcp_send_probe0(struct sock *sk)
>
>         icsk->icsk_probes_out++;
>         if (err <= 0) {
> -               if (icsk->icsk_backoff < READ_ONCE(net->ipv4.sysctl_tcp_retries2))
> -                       icsk->icsk_backoff++;
> +               icsk->icsk_backoff++;

I think we need to have a cap, otherwise we risk overflows in
inet_csk_rto_backoff()


>                 timeout = tcp_probe0_when(sk, tcp_rto_max(sk));
>         } else {
>                 /* If packet was not sent due to local congestion,
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index a207877270fb..4dba2928e1bf 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -419,9 +419,9 @@ static void tcp_probe_timer(struct sock *sk)
>         if (sock_flag(sk, SOCK_DEAD)) {
>                 unsigned int rto_max = tcp_rto_max(sk);
>                 const bool alive = inet_csk_rto_backoff(icsk, rto_max) < rto_max;
> +               int orphan_probes = tcp_orphan_retries(sk, alive);
>
> -               max_probes = tcp_orphan_retries(sk, alive);
> -               if (!alive && icsk->icsk_backoff >= max_probes)
> +               if (!alive || icsk->icsk_backoff >= orphan_probes)
>                         goto abort;
>                 if (tcp_out_of_resources(sk, true))
>                         return;
> --
> 2.46.0
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC net] tcp: Fix orphaned socket stalling indefinitely in FIN-WAIT-1
  2025-08-22  8:53 ` Eric Dumazet
@ 2025-08-22 16:02   ` Yuchung Cheng
  2025-08-25  6:11   ` MingMing Wang
  1 sibling, 0 replies; 4+ messages in thread
From: Yuchung Cheng @ 2025-08-22 16:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: MingMing Wang, ncardwell, kuniyu, davem, dsahern, kuba, pabeni,
	horms, netdev, linux-kernel, Dust Li

On Fri, Aug 22, 2025 at 1:53 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Aug 21, 2025 at 11:04 PM MingMing Wang <mii.w@linux.alibaba.com> wrote:
> >
> > From: MingMing Wang <mii.w@linux.alibaba.com>
> >
> > An orphaned TCP socket can stall indefinitely in FIN-WAIT-1
> > if the following conditions are met:
> > 1. net.ipv4.tcp_retries2 is set to a value ≤ 8;
> > 2. The peer advertises a zero window, and the window never reopens.
> >
> > Steps to reproduce:
> > 1. Set up two instances with nmap installed: one will act as the server
> >    the other as the client
> > 2. Execute on the server:
> >    a. lower rmem : `sysctl -w net.ipv4.tcp_rmem="16 32 32"`
> >    b. start a listener: `nc -l -p 1234`
> > 3. Execute on the client:
> >    a. lower tcp_retries2: `sysctl -w net.ipv4.tcp_retries2=8`
> >    b. send pakcets: `cat /dev/zero | nc <server-ip> 1234`
> >    c. after five seconds, stop the process: `killall nc`
> > 4. Execute on the server: `killall -STOP nc`
> > 5. Expected abnormal result: using `ss` command, we'll notice that the
> >    client connection remains stuck in the FIN_WAIT1 state, and the
> >    backoff counter always be 8 and no longer increased, as shown below:
> >    ```
>
> Hi MingMing
>
> Please prepare and share with us a packetdrill test, instead of this
> 'repro', which is the old way of describing things :/
>
> - This will be easier for us to understand the issue.
>
> - It will be added to existing tests in tools/testing/selftests/net/packetdrill
> if your patch is accepted, so that we can make sure future changes are
> not breaking this again.
>
> Ideally, you should attach this packetdrill test in a second patch
> (thus sending a series of two patches)
>
> Thank you.
>
> >    FIN-WAIT-1 0      1389    172.16.0.2:50316    172.16.0.1:1234
> >          cubic wscale:2,7 rto:201 backoff:8 rtt:0.078/0.007 mss:36
> >                  ... other fields omitted ...
> >    ```
> > 6. If we set tcp_retries2 to 15 and repeat the steps above, the FIN_WAIT1
> >    state will be forcefully reclaimed after about 5 minutes.
> >
> > During the zero-window probe retry process, it will check whether the
> > current connection is alive or not. If the connection is not alive and
> > the counter of retries exceeds the maximum allowed `max_probes`, retry
> > process will be terminated.
> >
> > In our case, when we set `net.ipv4.tcp_retries2` to 8 or a less value,
> > according to the current implementation, the `icsk->icsk_backoff` counter
> > will be capped at `net.ipv4.tcp_retries2`. The value calculated by
> > `inet_csk_rto_backoff` will always be too small, which means the
> > computed backoff duration will always be less than rto_max. As a result,
> > the alive check will always return true. The condition before the
> > `goto abort` statement is an logical AND condition, the abort branch
> > can never be reached.
> >
> > So, the TCP retransmission backoff mechanism has two issues:
> >
> > 1. `icsk->icsk_backoff` should monotonically increase during probe
> >    transmission and, upon reaching the maximum backoff limit, the
> >    connection should be terminated. However, the backoff value itself
> >    must not be capped prematurely — it should only control when to abort.
> >
> > 2. The condition for orphaned connection abort was incorrectly based on
> >    connection liveness and probe count. It should instead consider whether
> >    the number of orphaned probes exceeds the intended limit.
> >
> > To fix this, introduce a local variable `orphan_probes` to track orphan
> > probe attempts separately from `max_probes`, which is used for RTO
> > retransmissions. This decouples the two counters and prevents accidental
> > overwrites, ensuring correct timeout behavior for orphaned connections.
> >
> > Fixes: b248230c34970 ("tcp: abort orphan sockets stalling on zero window probes")
Thanks for catching this corner case. Feel free to add a Acked-by:
<ycheng@google.com> after the packetdrill test in your respin

> > Co-developed-by: Dust Li <dust.li@linux.alibaba.com>
> > Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
> > Co-developed-by: MingMing Wang <mii.w@linux.alibaba.com>
> > Signed-off-by: MingMing Wang <mii.w@linux.alibaba.com>
> >
> > ---
> > We couldn't determine the rationale behind the following check in tcp_send_probe0():
> > ```
> > if (icsk->icsk_backoff < READ_ONCE(net->ipv4.sysctl_tcp_retries2))
> >     icsk->icsk_backoff++;
> > ```
> >
> > This condition appears to be the root cause of the observed stall.
> > However, it has existed in the kernel for over 20 years — which suggests
> > there might be a historical or subtle reason for its presence.
> >
> > We would greatly appreciate it if anyone could shed
> > ---
> >  net/ipv4/tcp_output.c | 4 +---
> >  net/ipv4/tcp_timer.c  | 4 ++--
> >  2 files changed, 3 insertions(+), 5 deletions(-)
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index caf11920a878..21795d696e38 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -4385,7 +4385,6 @@ void tcp_send_probe0(struct sock *sk)
> >  {
> >         struct inet_connection_sock *icsk = inet_csk(sk);
> >         struct tcp_sock *tp = tcp_sk(sk);
> > -       struct net *net = sock_net(sk);
> >         unsigned long timeout;
> >         int err;
> >
> > @@ -4401,8 +4400,7 @@ void tcp_send_probe0(struct sock *sk)
> >
> >         icsk->icsk_probes_out++;
> >         if (err <= 0) {
> > -               if (icsk->icsk_backoff < READ_ONCE(net->ipv4.sysctl_tcp_retries2))
> > -                       icsk->icsk_backoff++;
> > +               icsk->icsk_backoff++;
>
> I think we need to have a cap, otherwise we risk overflows in
> inet_csk_rto_backoff()
>
>
> >                 timeout = tcp_probe0_when(sk, tcp_rto_max(sk));
> >         } else {
> >                 /* If packet was not sent due to local congestion,
> > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > index a207877270fb..4dba2928e1bf 100644
> > --- a/net/ipv4/tcp_timer.c
> > +++ b/net/ipv4/tcp_timer.c
> > @@ -419,9 +419,9 @@ static void tcp_probe_timer(struct sock *sk)
> >         if (sock_flag(sk, SOCK_DEAD)) {
> >                 unsigned int rto_max = tcp_rto_max(sk);
> >                 const bool alive = inet_csk_rto_backoff(icsk, rto_max) < rto_max;
> > +               int orphan_probes = tcp_orphan_retries(sk, alive);
> >
> > -               max_probes = tcp_orphan_retries(sk, alive);
> > -               if (!alive && icsk->icsk_backoff >= max_probes)
> > +               if (!alive || icsk->icsk_backoff >= orphan_probes)
> >                         goto abort;
> >                 if (tcp_out_of_resources(sk, true))
> >                         return;
> > --
> > 2.46.0
> >

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC net] tcp: Fix orphaned socket stalling indefinitely in FIN-WAIT-1
  2025-08-22  8:53 ` Eric Dumazet
  2025-08-22 16:02   ` Yuchung Cheng
@ 2025-08-25  6:11   ` MingMing Wang
  1 sibling, 0 replies; 4+ messages in thread
From: MingMing Wang @ 2025-08-25  6:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: ncardwell, kuniyu, davem, dsahern, kuba, pabeni, horms, ycheng,
	netdev, linux-kernel, Dust Li



On 2025/8/22 16:53, Eric Dumazet wrote:
> On Thu, Aug 21, 2025 at 11:04 PM MingMing Wang <mii.w@linux.alibaba.com> wrote:
>>
>> From: MingMing Wang <mii.w@linux.alibaba.com>
>>
>> An orphaned TCP socket can stall indefinitely in FIN-WAIT-1
>> if the following conditions are met:
>> 1. net.ipv4.tcp_retries2 is set to a value ≤ 8;
>> 2. The peer advertises a zero window, and the window never reopens.
>>
>> Steps to reproduce:
>> 1. Set up two instances with nmap installed: one will act as the server
>>     the other as the client
>> 2. Execute on the server:
>>     a. lower rmem : `sysctl -w net.ipv4.tcp_rmem="16 32 32"`
>>     b. start a listener: `nc -l -p 1234`
>> 3. Execute on the client:
>>     a. lower tcp_retries2: `sysctl -w net.ipv4.tcp_retries2=8`
>>     b. send pakcets: `cat /dev/zero | nc <server-ip> 1234`
>>     c. after five seconds, stop the process: `killall nc`
>> 4. Execute on the server: `killall -STOP nc`
>> 5. Expected abnormal result: using `ss` command, we'll notice that the
>>     client connection remains stuck in the FIN_WAIT1 state, and the
>>     backoff counter always be 8 and no longer increased, as shown below:
>>     ```

Thanks for your suggestions, Eric. We will prepare the packetdrill test
and resend a series of two patches.

> 
> Hi MingMing
> 
> Please prepare and share with us a packetdrill test, instead of this
> 'repro', which is the old way of describing things :/
> 
> - This will be easier for us to understand the issue.
> 
> - It will be added to existing tests in tools/testing/selftests/net/packetdrill
> if your patch is accepted, so that we can make sure future changes are
> not breaking this again.
> 
> Ideally, you should attach this packetdrill test in a second patch
> (thus sending a series of two patches)
> 
> Thank you.
> 
>>     FIN-WAIT-1 0      1389    172.16.0.2:50316    172.16.0.1:1234
>>           cubic wscale:2,7 rto:201 backoff:8 rtt:0.078/0.007 mss:36
>>                   ... other fields omitted ...
>>     ```
>> 6. If we set tcp_retries2 to 15 and repeat the steps above, the FIN_WAIT1
>>     state will be forcefully reclaimed after about 5 minutes.
>>
>> During the zero-window probe retry process, it will check whether the
>> current connection is alive or not. If the connection is not alive and
>> the counter of retries exceeds the maximum allowed `max_probes`, retry
>> process will be terminated.
>>
>> In our case, when we set `net.ipv4.tcp_retries2` to 8 or a less value,
>> according to the current implementation, the `icsk->icsk_backoff` counter
>> will be capped at `net.ipv4.tcp_retries2`. The value calculated by
>> `inet_csk_rto_backoff` will always be too small, which means the
>> computed backoff duration will always be less than rto_max. As a result,
>> the alive check will always return true. The condition before the
>> `goto abort` statement is an logical AND condition, the abort branch
>> can never be reached.
>>
>> So, the TCP retransmission backoff mechanism has two issues:
>>
>> 1. `icsk->icsk_backoff` should monotonically increase during probe
>>     transmission and, upon reaching the maximum backoff limit, the
>>     connection should be terminated. However, the backoff value itself
>>     must not be capped prematurely — it should only control when to abort.
>>
>> 2. The condition for orphaned connection abort was incorrectly based on
>>     connection liveness and probe count. It should instead consider whether
>>     the number of orphaned probes exceeds the intended limit.
>>
>> To fix this, introduce a local variable `orphan_probes` to track orphan
>> probe attempts separately from `max_probes`, which is used for RTO
>> retransmissions. This decouples the two counters and prevents accidental
>> overwrites, ensuring correct timeout behavior for orphaned connections.
>>
>> Fixes: b248230c34970 ("tcp: abort orphan sockets stalling on zero window probes")
>> Co-developed-by: Dust Li <dust.li@linux.alibaba.com>
>> Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
>> Co-developed-by: MingMing Wang <mii.w@linux.alibaba.com>
>> Signed-off-by: MingMing Wang <mii.w@linux.alibaba.com>
>>
>> ---
>> We couldn't determine the rationale behind the following check in tcp_send_probe0():
>> ```
>> if (icsk->icsk_backoff < READ_ONCE(net->ipv4.sysctl_tcp_retries2))
>>      icsk->icsk_backoff++;
>> ```
>>
>> This condition appears to be the root cause of the observed stall.
>> However, it has existed in the kernel for over 20 years — which suggests
>> there might be a historical or subtle reason for its presence.
>>
>> We would greatly appreciate it if anyone could shed
>> ---
>>   net/ipv4/tcp_output.c | 4 +---
>>   net/ipv4/tcp_timer.c  | 4 ++--
>>   2 files changed, 3 insertions(+), 5 deletions(-)
>>
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index caf11920a878..21795d696e38 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -4385,7 +4385,6 @@ void tcp_send_probe0(struct sock *sk)
>>   {
>>          struct inet_connection_sock *icsk = inet_csk(sk);
>>          struct tcp_sock *tp = tcp_sk(sk);
>> -       struct net *net = sock_net(sk);
>>          unsigned long timeout;
>>          int err;
>>
>> @@ -4401,8 +4400,7 @@ void tcp_send_probe0(struct sock *sk)
>>
>>          icsk->icsk_probes_out++;
>>          if (err <= 0) {
>> -               if (icsk->icsk_backoff < READ_ONCE(net->ipv4.sysctl_tcp_retries2))
>> -                       icsk->icsk_backoff++;
>> +               icsk->icsk_backoff++;

We agree with your perspective. Futhermore, as mentioned in the raw
text, we would appreciate it if you could clarify whether this cap is
just to prevent overflow by using a huge value, or if it should be set
to a specific meaningful value.

> 
> I think we need to have a cap, otherwise we risk overflows in
> inet_csk_rto_backoff()
> 
> 
>>                  timeout = tcp_probe0_when(sk, tcp_rto_max(sk));
>>          } else {
>>                  /* If packet was not sent due to local congestion,
>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>> index a207877270fb..4dba2928e1bf 100644
>> --- a/net/ipv4/tcp_timer.c
>> +++ b/net/ipv4/tcp_timer.c
>> @@ -419,9 +419,9 @@ static void tcp_probe_timer(struct sock *sk)
>>          if (sock_flag(sk, SOCK_DEAD)) {
>>                  unsigned int rto_max = tcp_rto_max(sk);
>>                  const bool alive = inet_csk_rto_backoff(icsk, rto_max) < rto_max;
>> +               int orphan_probes = tcp_orphan_retries(sk, alive);
>>
>> -               max_probes = tcp_orphan_retries(sk, alive);
>> -               if (!alive && icsk->icsk_backoff >= max_probes)
>> +               if (!alive || icsk->icsk_backoff >= orphan_probes)
>>                          goto abort;
>>                  if (tcp_out_of_resources(sk, true))
>>                          return;
>> --
>> 2.46.0
>>


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-08-25  6:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-22  6:02 [RFC net] tcp: Fix orphaned socket stalling indefinitely in FIN-WAIT-1 'MingMing Wang'
2025-08-22  8:53 ` Eric Dumazet
2025-08-22 16:02   ` Yuchung Cheng
2025-08-25  6:11   ` MingMing Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).