[PATCH net-next 0/3] net: tcp: support probing OOM

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 0/3] net: tcp: support probing OOM
@ 2023-07-27 12:51 menglong8.dong
  2023-07-27 12:51 ` [PATCH net-next 1/3] net: tcp: send zero-window ACK when no memory menglong8.dong
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: menglong8.dong @ 2023-07-27 12:51 UTC (permalink / raw)
  To: edumazet
  Cc: davem, kuba, pabeni, dsahern, netdev, linux-kernel, Menglong Dong

From: Menglong Dong <imagedong@tencent.com>

In this series, we make some small changes to make the tcp retransmission
become zero-window probes if the receiver drops the skb because of memory
pressure.

In the 1st patch, we reply a zero-window ACK if the skb is dropped
because out of memory, instead of dropping the skb silently.

In the 2nd patch, we allow a zero-window ACK to update the window.

In the 3rd patch, we check the timeout of a probing socket with
'(u32)icsk->icsk_timeout', instead of 'tcp_jiffies32'. This is more like
a bugfix.

After these changes, the tcp can probe the OOM of the receiver forever.

Menglong Dong (3):
  net: tcp: send zero-window ACK when no memory
  net: tcp: allow zero-window ACK update the window
  net: tcp: check timeout by icsk->icsk_timeout in
    tcp_retransmit_timer()

 include/net/inet_connection_sock.h |  3 ++-
 net/ipv4/tcp_input.c               |  6 +++---
 net/ipv4/tcp_output.c              | 14 +++++++++++---
 net/ipv4/tcp_timer.c               |  6 +++++-
 4 files changed, 21 insertions(+), 8 deletions(-)

-- 
2.40.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH net-next 1/3] net: tcp: send zero-window ACK when no memory
  2023-07-27 12:51 [PATCH net-next 0/3] net: tcp: support probing OOM menglong8.dong
@ 2023-07-27 12:51 ` menglong8.dong
  2023-07-27 19:17   ` Eric Dumazet
  2023-07-27 12:51 ` [PATCH net-next 2/3] net: tcp: allow zero-window ACK update the window menglong8.dong
  2023-07-27 12:51 ` [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer() menglong8.dong
  2 siblings, 1 reply; 17+ messages in thread
From: menglong8.dong @ 2023-07-27 12:51 UTC (permalink / raw)
  To: edumazet
  Cc: davem, kuba, pabeni, dsahern, netdev, linux-kernel, Menglong Dong

From: Menglong Dong <imagedong@tencent.com>

For now, skb will be dropped when no memory, which makes client keep
retrans util timeout and it's not friendly to the users.

In this patch, we reply an ACK with zero-window in this case to update
the snd_wnd of the sender to 0. Therefore, the sender won't timeout the
connection and will probe the zero-window with the retransmits.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
---
 include/net/inet_connection_sock.h |  3 ++-
 net/ipv4/tcp_input.c               |  4 ++--
 net/ipv4/tcp_output.c              | 14 +++++++++++---
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index c2b15f7e5516..be3c858a2ebb 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -164,7 +164,8 @@ enum inet_csk_ack_state_t {
 	ICSK_ACK_TIMER  = 2,
 	ICSK_ACK_PUSHED = 4,
 	ICSK_ACK_PUSHED2 = 8,
-	ICSK_ACK_NOW = 16	/* Send the next ACK immediately (once) */
+	ICSK_ACK_NOW = 16,	/* Send the next ACK immediately (once) */
+	ICSK_ACK_NOMEM = 32,
 };
 
 void inet_csk_init_xmit_timers(struct sock *sk,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 3cd92035e090..03111af6115d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5061,7 +5061,8 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 			reason = SKB_DROP_REASON_PROTO_MEM;
 			NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRCVQDROP);
 			sk->sk_data_ready(sk);
-			goto drop;
+			inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOMEM;
+			goto out_of_window;
 		}
 
 		eaten = tcp_queue_rcv(sk, skb, &fragstolen);
@@ -5102,7 +5103,6 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 out_of_window:
 		tcp_enter_quickack_mode(sk, TCP_MAX_QUICKACKS);
 		inet_csk_schedule_ack(sk);
-drop:
 		tcp_drop_reason(sk, skb, reason);
 		return;
 	}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2cb39b6dad02..81aa2c615924 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -257,11 +257,19 @@ EXPORT_SYMBOL(tcp_select_initial_window);
 static u16 tcp_select_window(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	u32 old_win = tp->rcv_wnd;
-	u32 cur_win = tcp_receive_window(tp);
-	u32 new_win = __tcp_select_window(sk);
 	struct net *net = sock_net(sk);
+	u32 old_win = tp->rcv_wnd;
+	u32 cur_win, new_win;
+
+	/* Make the window 0 if we failed to queue the data because we
+	 * are out of memory. The window is temporary, so we don't store
+	 * it on the socket.
+	 */
+	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+		return 0;
 
+	cur_win = tcp_receive_window(tp);
+	new_win = __tcp_select_window(sk);
 	if (new_win < cur_win) {
 		/* Danger Will Robinson!
 		 * Don't update rcv_wup/rcv_wnd here or else
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next 2/3] net: tcp: allow zero-window ACK update the window
  2023-07-27 12:51 [PATCH net-next 0/3] net: tcp: support probing OOM menglong8.dong
  2023-07-27 12:51 ` [PATCH net-next 1/3] net: tcp: send zero-window ACK when no memory menglong8.dong
@ 2023-07-27 12:51 ` menglong8.dong
  2023-07-27 12:51 ` [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer() menglong8.dong
  2 siblings, 0 replies; 17+ messages in thread
From: menglong8.dong @ 2023-07-27 12:51 UTC (permalink / raw)
  To: edumazet
  Cc: davem, kuba, pabeni, dsahern, netdev, linux-kernel, Menglong Dong

From: Menglong Dong <imagedong@tencent.com>

Fow now, an ACK can update the window in following case, according to
the tcp_may_update_window():

1. the ACK acknowledged new data
2. the ACK has new data
3. the ACK expand the window and the seq of it is valid

Now, we allow the ACK update the window if the window is 0, and the
seq/ack of it is valid. This is for the case that the receiver replies
an zero-window ACK when it is under memory stress and can't queue the new
data.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
---
 net/ipv4/tcp_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 03111af6115d..d13b89370f76 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3526,7 +3526,7 @@ static inline bool tcp_may_update_window(const struct tcp_sock *tp,
 {
 	return	after(ack, tp->snd_una) ||
 		after(ack_seq, tp->snd_wl1) ||
-		(ack_seq == tp->snd_wl1 && nwin > tp->snd_wnd);
+		(ack_seq == tp->snd_wl1 && (nwin > tp->snd_wnd || !nwin));
 }
 
 /* If we update tp->snd_una, also update tp->bytes_acked */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()
  2023-07-27 12:51 [PATCH net-next 0/3] net: tcp: support probing OOM menglong8.dong
  2023-07-27 12:51 ` [PATCH net-next 1/3] net: tcp: send zero-window ACK when no memory menglong8.dong
  2023-07-27 12:51 ` [PATCH net-next 2/3] net: tcp: allow zero-window ACK update the window menglong8.dong
@ 2023-07-27 12:51 ` menglong8.dong
  2023-07-27 19:31   ` Eric Dumazet
  2 siblings, 1 reply; 17+ messages in thread
From: menglong8.dong @ 2023-07-27 12:51 UTC (permalink / raw)
  To: edumazet
  Cc: davem, kuba, pabeni, dsahern, netdev, linux-kernel, Menglong Dong

From: Menglong Dong <imagedong@tencent.com>

In tcp_retransmit_timer(), a window shrunk connection will be regarded
as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
right all the time.

The retransmits will become zero-window probes in tcp_retransmit_timer()
if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
TCP_RTO_MAX sooner or later.

However, the timer is not precise enough, as it base on timer wheel.
Sorry that I am not good at timer, but I know the concept of time-wheel.
The longer of the timer, the rougher it will be. So the timeout is not
triggered after TCP_RTO_MAX, but 122877ms as I tested.

Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
once the RTO come up to TCP_RTO_MAX.

Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
which is exact the timestamp of the timeout.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
---
 net/ipv4/tcp_timer.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 470f581eedd4..3a20db15a186 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
 					    tp->snd_una, tp->snd_nxt);
 		}
 #endif
-		if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
+		/* It's a little rough here, we regard any valid packet that
+		 * update tp->rcv_tstamp as the reply of the retransmitted
+		 * packet.
+		 */
+		if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
 			tcp_write_err(sk);
 			goto out;
 		}
-- 
2.40.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 1/3] net: tcp: send zero-window ACK when no memory
  2023-07-27 12:51 ` [PATCH net-next 1/3] net: tcp: send zero-window ACK when no memory menglong8.dong
@ 2023-07-27 19:17   ` Eric Dumazet
  2023-07-28  2:37     ` Menglong Dong
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2023-07-27 19:17 UTC (permalink / raw)
  To: menglong8.dong
  Cc: davem, kuba, pabeni, dsahern, netdev, linux-kernel, Menglong Dong

On Thu, Jul 27, 2023 at 2:51 PM <menglong8.dong@gmail.com> wrote:
>
> From: Menglong Dong <imagedong@tencent.com>
>
> For now, skb will be dropped when no memory, which makes client keep
> retrans util timeout and it's not friendly to the users.
>
> In this patch, we reply an ACK with zero-window in this case to update
> the snd_wnd of the sender to 0. Therefore, the sender won't timeout the
> connection and will probe the zero-window with the retransmits.
>
> Signed-off-by: Menglong Dong <imagedong@tencent.com>
> ---
>  include/net/inet_connection_sock.h |  3 ++-
>  net/ipv4/tcp_input.c               |  4 ++--
>  net/ipv4/tcp_output.c              | 14 +++++++++++---
>  3 files changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index c2b15f7e5516..be3c858a2ebb 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -164,7 +164,8 @@ enum inet_csk_ack_state_t {
>         ICSK_ACK_TIMER  = 2,
>         ICSK_ACK_PUSHED = 4,
>         ICSK_ACK_PUSHED2 = 8,
> -       ICSK_ACK_NOW = 16       /* Send the next ACK immediately (once) */
> +       ICSK_ACK_NOW = 16,      /* Send the next ACK immediately (once) */
> +       ICSK_ACK_NOMEM = 32,
>  };
>
>  void inet_csk_init_xmit_timers(struct sock *sk,
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 3cd92035e090..03111af6115d 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5061,7 +5061,8 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
>                         reason = SKB_DROP_REASON_PROTO_MEM;
>                         NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRCVQDROP);
>                         sk->sk_data_ready(sk);
> -                       goto drop;
> +                       inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOMEM;

Also set ICSK_ACK_NOW ?

We also need to make sure to send an immediate ACK WIN 0 in the case we queued
the skb in an empty receive queue and we were under pressure.

We do not want to have a delayed ACK sending a normal RWIN,
then wait for another packet that we will probably drop.

Look at the code :

if (skb_queue_len(&sk->sk_receive_queue) == 0)
     sk_forced_mem_schedule(sk, skb->truesize);
else if (tcp_try_rmem_schedule(sk, skb, skb->truesize)) {

and refactor it to make sure to set  ICSK_ACK_NOMEM even on the first packet
that bypassed the rmem_schedule().



> +                       goto out_of_window;

Why forcing quickack mode ? Please leave the "goto drop;"

>                 }
>
>                 eaten = tcp_queue_rcv(sk, skb, &fragstolen);
> @@ -5102,7 +5103,6 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
>  out_of_window:
>                 tcp_enter_quickack_mode(sk, TCP_MAX_QUICKACKS);
>                 inet_csk_schedule_ack(sk);
> -drop:
>                 tcp_drop_reason(sk, skb, reason);
>                 return;
>         }
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()
  2023-07-27 12:51 ` [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer() menglong8.dong
@ 2023-07-27 19:31   ` Eric Dumazet
  2023-07-28  2:57     ` Menglong Dong
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2023-07-27 19:31 UTC (permalink / raw)
  To: menglong8.dong
  Cc: davem, kuba, pabeni, dsahern, netdev, linux-kernel, Menglong Dong,
	Neal Cardwell

On Thu, Jul 27, 2023 at 2:52 PM <menglong8.dong@gmail.com> wrote:
>
> From: Menglong Dong <imagedong@tencent.com>
>
> In tcp_retransmit_timer(), a window shrunk connection will be regarded
> as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> right all the time.
>
> The retransmits will become zero-window probes in tcp_retransmit_timer()
> if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> TCP_RTO_MAX sooner or later.
>
> However, the timer is not precise enough, as it base on timer wheel.
> Sorry that I am not good at timer, but I know the concept of time-wheel.
> The longer of the timer, the rougher it will be. So the timeout is not
> triggered after TCP_RTO_MAX, but 122877ms as I tested.
>
> Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> once the RTO come up to TCP_RTO_MAX.
>
> Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> which is exact the timestamp of the timeout.
>
> Signed-off-by: Menglong Dong <imagedong@tencent.com>
> ---
>  net/ipv4/tcp_timer.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 470f581eedd4..3a20db15a186 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
>                                             tp->snd_una, tp->snd_nxt);
>                 }
>  #endif
> -               if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> +               /* It's a little rough here, we regard any valid packet that
> +                * update tp->rcv_tstamp as the reply of the retransmitted
> +                * packet.
> +                */
> +               if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
>                         tcp_write_err(sk);
>                         goto out;
>                 }


Hmm, this looks like a net candidate, since this is unrelated to the
other patches ?

Neal, what do you think ?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 1/3] net: tcp: send zero-window ACK when no memory
  2023-07-27 19:17   ` Eric Dumazet
@ 2023-07-28  2:37     ` Menglong Dong
  0 siblings, 0 replies; 17+ messages in thread
From: Menglong Dong @ 2023-07-28  2:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, kuba, pabeni, dsahern, netdev, linux-kernel, Menglong Dong

On Fri, Jul 28, 2023 at 3:17 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Jul 27, 2023 at 2:51 PM <menglong8.dong@gmail.com> wrote:
> >
> > From: Menglong Dong <imagedong@tencent.com>
> >
> > For now, skb will be dropped when no memory, which makes client keep
> > retrans util timeout and it's not friendly to the users.
> >
> > In this patch, we reply an ACK with zero-window in this case to update
> > the snd_wnd of the sender to 0. Therefore, the sender won't timeout the
> > connection and will probe the zero-window with the retransmits.
> >
> > Signed-off-by: Menglong Dong <imagedong@tencent.com>
> > ---
> >  include/net/inet_connection_sock.h |  3 ++-
> >  net/ipv4/tcp_input.c               |  4 ++--
> >  net/ipv4/tcp_output.c              | 14 +++++++++++---
> >  3 files changed, 15 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > index c2b15f7e5516..be3c858a2ebb 100644
> > --- a/include/net/inet_connection_sock.h
> > +++ b/include/net/inet_connection_sock.h
> > @@ -164,7 +164,8 @@ enum inet_csk_ack_state_t {
> >         ICSK_ACK_TIMER  = 2,
> >         ICSK_ACK_PUSHED = 4,
> >         ICSK_ACK_PUSHED2 = 8,
> > -       ICSK_ACK_NOW = 16       /* Send the next ACK immediately (once) */
> > +       ICSK_ACK_NOW = 16,      /* Send the next ACK immediately (once) */
> > +       ICSK_ACK_NOMEM = 32,
> >  };
> >
> >  void inet_csk_init_xmit_timers(struct sock *sk,
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 3cd92035e090..03111af6115d 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -5061,7 +5061,8 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
> >                         reason = SKB_DROP_REASON_PROTO_MEM;
> >                         NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRCVQDROP);
> >                         sk->sk_data_ready(sk);
> > -                       goto drop;
> > +                       inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOMEM;
>
> Also set ICSK_ACK_NOW ?
>
> We also need to make sure to send an immediate ACK WIN 0 in the case we queued
> the skb in an empty receive queue and we were under pressure.
>
> We do not want to have a delayed ACK sending a normal RWIN,
> then wait for another packet that we will probably drop.
>
> Look at the code :
>
> if (skb_queue_len(&sk->sk_receive_queue) == 0)
>      sk_forced_mem_schedule(sk, skb->truesize);
> else if (tcp_try_rmem_schedule(sk, skb, skb->truesize)) {
>
> and refactor it to make sure to set  ICSK_ACK_NOMEM even on the first packet
> that bypassed the rmem_schedule().
>

Ok, that makes sense.

>
>
> > +                       goto out_of_window;
>
> Why forcing quickack mode ? Please leave the "goto drop;"
>

Hmm...because I want to make it send an immediate
ACK. Obviously, a ICSK_ACK_NOW flag should be
the better choice here.

> >                 }
> >
> >                 eaten = tcp_queue_rcv(sk, skb, &fragstolen);
> > @@ -5102,7 +5103,6 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
> >  out_of_window:
> >                 tcp_enter_quickack_mode(sk, TCP_MAX_QUICKACKS);
> >                 inet_csk_schedule_ack(sk);
> > -drop:
> >                 tcp_drop_reason(sk, skb, reason);
> >                 return;
> >         }
> >

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()
  2023-07-27 19:31   ` Eric Dumazet
@ 2023-07-28  2:57     ` Menglong Dong
  2023-07-28  4:44       ` Neal Cardwell
  0 siblings, 1 reply; 17+ messages in thread
From: Menglong Dong @ 2023-07-28  2:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, kuba, pabeni, dsahern, netdev, linux-kernel, Menglong Dong,
	Neal Cardwell

On Fri, Jul 28, 2023 at 3:31 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Jul 27, 2023 at 2:52 PM <menglong8.dong@gmail.com> wrote:
> >
> > From: Menglong Dong <imagedong@tencent.com>
> >
> > In tcp_retransmit_timer(), a window shrunk connection will be regarded
> > as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> > right all the time.
> >
> > The retransmits will become zero-window probes in tcp_retransmit_timer()
> > if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> > TCP_RTO_MAX sooner or later.
> >
> > However, the timer is not precise enough, as it base on timer wheel.
> > Sorry that I am not good at timer, but I know the concept of time-wheel.
> > The longer of the timer, the rougher it will be. So the timeout is not
> > triggered after TCP_RTO_MAX, but 122877ms as I tested.
> >
> > Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> > once the RTO come up to TCP_RTO_MAX.
> >
> > Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> > which is exact the timestamp of the timeout.
> >
> > Signed-off-by: Menglong Dong <imagedong@tencent.com>
> > ---
> >  net/ipv4/tcp_timer.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > index 470f581eedd4..3a20db15a186 100644
> > --- a/net/ipv4/tcp_timer.c
> > +++ b/net/ipv4/tcp_timer.c
> > @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> >                                             tp->snd_una, tp->snd_nxt);
> >                 }
> >  #endif
> > -               if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> > +               /* It's a little rough here, we regard any valid packet that
> > +                * update tp->rcv_tstamp as the reply of the retransmitted
> > +                * packet.
> > +                */
> > +               if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> >                         tcp_write_err(sk);
> >                         goto out;
> >                 }
>
>
> Hmm, this looks like a net candidate, since this is unrelated to the
> other patches ?

Yeah, this patch can be standalone. However, considering the
purpose of this series, it is necessary. Without this patch, the
OOM probe will always timeout after a few minutes.

I'm not sure if I express the problem clearly in the commit log.
Let's explain it more.

Let's mark the timestamp of the 10th timeout of the rtx timer
as TS1. Now, the retransmission happens and the ACK of
the retransmitted packet will update the tp->rcv_tstamp to
TS1+rtt.

The RTO now is TCP_RTO_MAX. So let's see what will
happen in the 11th timeout. As we timeout after 122877ms,
so tcp_jiffies32 now is "TS1+122877ms", and
"tcp_jiffies32 - tp->rcv_tstamp" is
"TS1+122877ms - (TS1+rtt)" -> "122877ms - rtt",
which is always bigger than TCP_RTO_MAX, which is 120000ms.

>
> Neal, what do you think ?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()
  2023-07-28  2:57     ` Menglong Dong
@ 2023-07-28  4:44       ` Neal Cardwell
  2023-07-28  6:25         ` Menglong Dong
  0 siblings, 1 reply; 17+ messages in thread
From: Neal Cardwell @ 2023-07-28  4:44 UTC (permalink / raw)
  To: Menglong Dong
  Cc: Eric Dumazet, davem, kuba, pabeni, dsahern, netdev, linux-kernel,
	Menglong Dong, Yuchung Cheng, Soheil Hassas Yeganeh

On Thu, Jul 27, 2023 at 7:57 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
>
> On Fri, Jul 28, 2023 at 3:31 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Jul 27, 2023 at 2:52 PM <menglong8.dong@gmail.com> wrote:
> > >
> > > From: Menglong Dong <imagedong@tencent.com>
> > >
> > > In tcp_retransmit_timer(), a window shrunk connection will be regarded
> > > as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> > > right all the time.
> > >
> > > The retransmits will become zero-window probes in tcp_retransmit_timer()
> > > if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> > > TCP_RTO_MAX sooner or later.
> > >
> > > However, the timer is not precise enough, as it base on timer wheel.
> > > Sorry that I am not good at timer, but I know the concept of time-wheel.
> > > The longer of the timer, the rougher it will be. So the timeout is not
> > > triggered after TCP_RTO_MAX, but 122877ms as I tested.
> > >
> > > Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> > > once the RTO come up to TCP_RTO_MAX.
> > >
> > > Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> > > which is exact the timestamp of the timeout.
> > >
> > > Signed-off-by: Menglong Dong <imagedong@tencent.com>
> > > ---
> > >  net/ipv4/tcp_timer.c | 6 +++++-
> > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > > index 470f581eedd4..3a20db15a186 100644
> > > --- a/net/ipv4/tcp_timer.c
> > > +++ b/net/ipv4/tcp_timer.c
> > > @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> > >                                             tp->snd_una, tp->snd_nxt);
> > >                 }
> > >  #endif
> > > -               if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > +               /* It's a little rough here, we regard any valid packet that
> > > +                * update tp->rcv_tstamp as the reply of the retransmitted
> > > +                * packet.
> > > +                */
> > > +               if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> > >                         tcp_write_err(sk);
> > >                         goto out;
> > >                 }
> >
> >
> > Hmm, this looks like a net candidate, since this is unrelated to the
> > other patches ?
>
> Yeah, this patch can be standalone. However, considering the
> purpose of this series, it is necessary. Without this patch, the
> OOM probe will always timeout after a few minutes.
>
> I'm not sure if I express the problem clearly in the commit log.
> Let's explain it more.
>
> Let's mark the timestamp of the 10th timeout of the rtx timer
> as TS1. Now, the retransmission happens and the ACK of
> the retransmitted packet will update the tp->rcv_tstamp to
> TS1+rtt.
>
> The RTO now is TCP_RTO_MAX. So let's see what will
> happen in the 11th timeout. As we timeout after 122877ms,
> so tcp_jiffies32 now is "TS1+122877ms", and
> "tcp_jiffies32 - tp->rcv_tstamp" is
> "TS1+122877ms - (TS1+rtt)" -> "122877ms - rtt",
> which is always bigger than TCP_RTO_MAX, which is 120000ms.
>
> >
> > Neal, what do you think ?

Sorry, I am probably missing something here, but: what would ever make
this new proposed condition ((u32)icsk->icsk_timeout - tp->rcv_tstamp
> TCP_RTO_MAX) true? :-)

In your nicely explained scenario, your new expression,
icsk->icsk_timeout - tp->rcv_tstamp, will be:

  icsk->icsk_timeout - tp->rcv_tstamp
= TS1 + 120 sec      - (TS1+rtt)
= 120 sec - RTT

AFAICT there is no way for that expression to be bigger than
TCP_RTO_MAX = 120 sec unless somehow RTT is negative. :-)

So AFAICT your expression ((u32)icsk->icsk_timeout - tp->rcv_tstamp >
TCP_RTO_MAX) will always be false, so rather than this patch we may as
well remove the if check and the body of the if block?

To me such a change does not seem like a safe and clear bug fix for
the "net" branch but rather a riskier design change (appropriate for
"net-next" branch) that has connections retry forever when the
receiver retracts the window to zero, under the estimation that this
is preferable to having the connections die in such a case.

There might be apps that depend on the old behavior of having
connections die in such cases, so we might want to have this new
fail-faster behavior guarded by a sysctl in case some sites need to
revert to the older behavior? Not sure...

neal

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()
  2023-07-28  4:44       ` Neal Cardwell
@ 2023-07-28  6:25         ` Menglong Dong
  2023-07-28  8:50           ` Eric Dumazet
  0 siblings, 1 reply; 17+ messages in thread
From: Menglong Dong @ 2023-07-28  6:25 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, davem, kuba, pabeni, dsahern, netdev, linux-kernel,
	Menglong Dong, Yuchung Cheng, Soheil Hassas Yeganeh

On Fri, Jul 28, 2023 at 12:44 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Thu, Jul 27, 2023 at 7:57 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
> >
> > On Fri, Jul 28, 2023 at 3:31 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Thu, Jul 27, 2023 at 2:52 PM <menglong8.dong@gmail.com> wrote:
> > > >
> > > > From: Menglong Dong <imagedong@tencent.com>
> > > >
> > > > In tcp_retransmit_timer(), a window shrunk connection will be regarded
> > > > as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> > > > right all the time.
> > > >
> > > > The retransmits will become zero-window probes in tcp_retransmit_timer()
> > > > if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> > > > TCP_RTO_MAX sooner or later.
> > > >
> > > > However, the timer is not precise enough, as it base on timer wheel.
> > > > Sorry that I am not good at timer, but I know the concept of time-wheel.
> > > > The longer of the timer, the rougher it will be. So the timeout is not
> > > > triggered after TCP_RTO_MAX, but 122877ms as I tested.
> > > >
> > > > Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> > > > once the RTO come up to TCP_RTO_MAX.
> > > >
> > > > Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> > > > which is exact the timestamp of the timeout.
> > > >
> > > > Signed-off-by: Menglong Dong <imagedong@tencent.com>
> > > > ---
> > > >  net/ipv4/tcp_timer.c | 6 +++++-
> > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > > > index 470f581eedd4..3a20db15a186 100644
> > > > --- a/net/ipv4/tcp_timer.c
> > > > +++ b/net/ipv4/tcp_timer.c
> > > > @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> > > >                                             tp->snd_una, tp->snd_nxt);
> > > >                 }
> > > >  #endif
> > > > -               if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > +               /* It's a little rough here, we regard any valid packet that
> > > > +                * update tp->rcv_tstamp as the reply of the retransmitted
> > > > +                * packet.
> > > > +                */
> > > > +               if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > >                         tcp_write_err(sk);
> > > >                         goto out;
> > > >                 }
> > >
> > >
> > > Hmm, this looks like a net candidate, since this is unrelated to the
> > > other patches ?
> >
> > Yeah, this patch can be standalone. However, considering the
> > purpose of this series, it is necessary. Without this patch, the
> > OOM probe will always timeout after a few minutes.
> >
> > I'm not sure if I express the problem clearly in the commit log.
> > Let's explain it more.
> >
> > Let's mark the timestamp of the 10th timeout of the rtx timer
> > as TS1. Now, the retransmission happens and the ACK of
> > the retransmitted packet will update the tp->rcv_tstamp to
> > TS1+rtt.
> >
> > The RTO now is TCP_RTO_MAX. So let's see what will
> > happen in the 11th timeout. As we timeout after 122877ms,
> > so tcp_jiffies32 now is "TS1+122877ms", and
> > "tcp_jiffies32 - tp->rcv_tstamp" is
> > "TS1+122877ms - (TS1+rtt)" -> "122877ms - rtt",
> > which is always bigger than TCP_RTO_MAX, which is 120000ms.
> >
> > >
> > > Neal, what do you think ?
>
> Sorry, I am probably missing something here, but: what would ever make
> this new proposed condition ((u32)icsk->icsk_timeout - tp->rcv_tstamp
> > TCP_RTO_MAX) true? :-)
>

If the snd_wnd is 0, we need to keep probing until the window
is available. Meanwhile, any retransmission that don't have
a corresponding ACK (see what we do in the 1st patch), which
can be caused by the lost of the ACK or the lost of the retransmitted
packet, can make the condition true, as the tp->rcv_tstamp can't be
updated in time.

This is a little strict here. In the tcp_probe_timer(), we are allowed to
retransmit the probe0 packet for sysctl_tcp_retries2 times. But
we don't allow packets to be lost here.

> In your nicely explained scenario, your new expression,
> icsk->icsk_timeout - tp->rcv_tstamp, will be:
>
>   icsk->icsk_timeout - tp->rcv_tstamp
> = TS1 + 120 sec      - (TS1+rtt)
> = 120 sec - RTT
>
> AFAICT there is no way for that expression to be bigger than
> TCP_RTO_MAX = 120 sec unless somehow RTT is negative. :-)
>
> So AFAICT your expression ((u32)icsk->icsk_timeout - tp->rcv_tstamp >
> TCP_RTO_MAX) will always be false, so rather than this patch we may as
> well remove the if check and the body of the if block?
>

Hmm......as I explained above, the condition will be true
if the real packet loss happens. And I think it is the origin
design.

> To me such a change does not seem like a safe and clear bug fix for
> the "net" branch but rather a riskier design change (appropriate for
> "net-next" branch) that has connections retry forever when the
> receiver retracts the window to zero, under the estimation that this
> is preferable to having the connections die in such a case.
>
> There might be apps that depend on the old behavior of having
> connections die in such cases, so we might want to have this new
> fail-faster behavior guarded by a sysctl in case some sites need to
> revert to the older behavior? Not sure...

Yeah, the behavior here will be different for the users. I'm not
sure if there are any users that rely on such behavior.

What do you think, Eric? Do we need a sysctl here?

Thanks!
Menglong Dong

>
> neal

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()
  2023-07-28  6:25         ` Menglong Dong
@ 2023-07-28  8:50           ` Eric Dumazet
  2023-07-28 14:25             ` Neal Cardwell
  0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2023-07-28  8:50 UTC (permalink / raw)
  To: Menglong Dong
  Cc: Neal Cardwell, davem, kuba, pabeni, dsahern, netdev, linux-kernel,
	Menglong Dong, Yuchung Cheng, Soheil Hassas Yeganeh

On Fri, Jul 28, 2023 at 8:25 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
>
> On Fri, Jul 28, 2023 at 12:44 PM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Thu, Jul 27, 2023 at 7:57 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > >
> > > On Fri, Jul 28, 2023 at 3:31 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Thu, Jul 27, 2023 at 2:52 PM <menglong8.dong@gmail.com> wrote:
> > > > >
> > > > > From: Menglong Dong <imagedong@tencent.com>
> > > > >
> > > > > In tcp_retransmit_timer(), a window shrunk connection will be regarded
> > > > > as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> > > > > right all the time.
> > > > >
> > > > > The retransmits will become zero-window probes in tcp_retransmit_timer()
> > > > > if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> > > > > TCP_RTO_MAX sooner or later.
> > > > >
> > > > > However, the timer is not precise enough, as it base on timer wheel.
> > > > > Sorry that I am not good at timer, but I know the concept of time-wheel.
> > > > > The longer of the timer, the rougher it will be. So the timeout is not
> > > > > triggered after TCP_RTO_MAX, but 122877ms as I tested.
> > > > >
> > > > > Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> > > > > once the RTO come up to TCP_RTO_MAX.
> > > > >
> > > > > Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> > > > > which is exact the timestamp of the timeout.
> > > > >
> > > > > Signed-off-by: Menglong Dong <imagedong@tencent.com>
> > > > > ---
> > > > >  net/ipv4/tcp_timer.c | 6 +++++-
> > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > > > > index 470f581eedd4..3a20db15a186 100644
> > > > > --- a/net/ipv4/tcp_timer.c
> > > > > +++ b/net/ipv4/tcp_timer.c
> > > > > @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> > > > >                                             tp->snd_una, tp->snd_nxt);
> > > > >                 }
> > > > >  #endif
> > > > > -               if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > +               /* It's a little rough here, we regard any valid packet that
> > > > > +                * update tp->rcv_tstamp as the reply of the retransmitted
> > > > > +                * packet.
> > > > > +                */
> > > > > +               if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > >                         tcp_write_err(sk);
> > > > >                         goto out;
> > > > >                 }
> > > >
> > > >
> > > > Hmm, this looks like a net candidate, since this is unrelated to the
> > > > other patches ?
> > >
> > > Yeah, this patch can be standalone. However, considering the
> > > purpose of this series, it is necessary. Without this patch, the
> > > OOM probe will always timeout after a few minutes.
> > >
> > > I'm not sure if I express the problem clearly in the commit log.
> > > Let's explain it more.
> > >
> > > Let's mark the timestamp of the 10th timeout of the rtx timer
> > > as TS1. Now, the retransmission happens and the ACK of
> > > the retransmitted packet will update the tp->rcv_tstamp to
> > > TS1+rtt.
> > >
> > > The RTO now is TCP_RTO_MAX. So let's see what will
> > > happen in the 11th timeout. As we timeout after 122877ms,
> > > so tcp_jiffies32 now is "TS1+122877ms", and
> > > "tcp_jiffies32 - tp->rcv_tstamp" is
> > > "TS1+122877ms - (TS1+rtt)" -> "122877ms - rtt",
> > > which is always bigger than TCP_RTO_MAX, which is 120000ms.
> > >
> > > >
> > > > Neal, what do you think ?
> >
> > Sorry, I am probably missing something here, but: what would ever make
> > this new proposed condition ((u32)icsk->icsk_timeout - tp->rcv_tstamp
> > > TCP_RTO_MAX) true? :-)
> >
>
> If the snd_wnd is 0, we need to keep probing until the window
> is available. Meanwhile, any retransmission that don't have
> a corresponding ACK (see what we do in the 1st patch), which
> can be caused by the lost of the ACK or the lost of the retransmitted
> packet, can make the condition true, as the tp->rcv_tstamp can't be
> updated in time.
>
> This is a little strict here. In the tcp_probe_timer(), we are allowed to
> retransmit the probe0 packet for sysctl_tcp_retries2 times. But
> we don't allow packets to be lost here.
>
> > In your nicely explained scenario, your new expression,
> > icsk->icsk_timeout - tp->rcv_tstamp, will be:
> >
> >   icsk->icsk_timeout - tp->rcv_tstamp
> > = TS1 + 120 sec      - (TS1+rtt)
> > = 120 sec - RTT
> >
> > AFAICT there is no way for that expression to be bigger than
> > TCP_RTO_MAX = 120 sec unless somehow RTT is negative. :-)
> >
> > So AFAICT your expression ((u32)icsk->icsk_timeout - tp->rcv_tstamp >
> > TCP_RTO_MAX) will always be false, so rather than this patch we may as
> > well remove the if check and the body of the if block?
> >
>
> Hmm......as I explained above, the condition will be true
> if the real packet loss happens. And I think it is the origin
> design.
>
> > To me such a change does not seem like a safe and clear bug fix for
> > the "net" branch but rather a riskier design change (appropriate for
> > "net-next" branch) that has connections retry forever when the
> > receiver retracts the window to zero, under the estimation that this
> > is preferable to having the connections die in such a case.
> >
> > There might be apps that depend on the old behavior of having
> > connections die in such cases, so we might want to have this new
> > fail-faster behavior guarded by a sysctl in case some sites need to
> > revert to the older behavior? Not sure...
>
> Yeah, the behavior here will be different for the users. I'm not
> sure if there are any users that rely on such behavior.
>
> What do you think, Eric? Do we need a sysctl here?
>

I honestly do not know what problem you want to solve.

As Neal pointed out, the new condition would not trigger,
so one can question about the whole piece of code,
what is its purpose exactly ?

When receiving WIN 0 acks, we should enter the so called probe0 state.
Maybe the real issue is that the 'receiver' will falsely send WIN X>0 ACKS,
because win probe acks do not really ensure memory is available to
receive new packets ?

Maybe provide a packetdrill test to show what kind of issue you are facing...

In Google kernels, we have TCP_MAX_RTO reduced to 30 seconds, and the
following test runs fine.

// Test how sender reacts to unexpected arrival rwin of 0.

`../common/defaults.sh`

// Create a socket.
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

// Establish a connection.
  +.1 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 6>
   +0 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,nop,sackOK,nop,wscale 8>
  +.1 < . 1:1(0) ack 1 win 457
   +0 accept(3, ..., ...) = 4

   +0 write(4, ..., 20000) = 20000
   +0 > P. 1:10001(10000) ack 1
  +.1 < . 1:1(0) ack 10001 win 0

// Send zwp since we received rwin of 0 and have data to send.
  +.3 > . 10000:10000(0) ack 1
  +.1 < . 1:1(0) ack 10001 win 0

  +.6 > . 10000:10000(0) ack 1
  +.1 < . 1:1(0) ack 10001 win 0

 +1.2 > . 10000:10000(0) ack 1
  +.1 < . 1:1(0) ack 10001 win 0

 +2.4 > . 10000:10000(0) ack 1
  +.1 < . 1:1(0) ack 10001 win  0

 +4.8 > . 10000:10000(0) ack 1
  +.1 < . 1:1(0) ack 10001 win  0

 +9.6 > . 10000:10000(0) ack 1
  +.1 < . 1:1(0) ack 10001 win  0

+19.2 > . 10000:10000(0) ack 1
  +.1 < . 1:1(0) ack 10001 win 0

+30   > . 10000:10000(0) ack 1
  +.1 < . 1:1(0) ack 10001 win 0

+30   > . 10000:10000(0) ack 1
  +.1 < . 1:1(0) ack 10001 win 193

// Received non-zero window update. Send rest of the data.
   +0 > P. 10001:20001(10000) ack 1
  +.1 < . 1:1(0) ack 20001 win 457

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()
  2023-07-28  8:50           ` Eric Dumazet
@ 2023-07-28 14:25             ` Neal Cardwell
  2023-07-28 17:15               ` Neal Cardwell
                                 ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Neal Cardwell @ 2023-07-28 14:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Menglong Dong, davem, kuba, pabeni, dsahern, netdev, linux-kernel,
	Menglong Dong, Yuchung Cheng, Soheil Hassas Yeganeh

On Fri, Jul 28, 2023 at 1:50 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Jul 28, 2023 at 8:25 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> >
> > On Fri, Jul 28, 2023 at 12:44 PM Neal Cardwell <ncardwell@google.com> wrote:
> > >
> > > On Thu, Jul 27, 2023 at 7:57 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > > >
> > > > On Fri, Jul 28, 2023 at 3:31 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Thu, Jul 27, 2023 at 2:52 PM <menglong8.dong@gmail.com> wrote:
> > > > > >
> > > > > > From: Menglong Dong <imagedong@tencent.com>
> > > > > >
> > > > > > In tcp_retransmit_timer(), a window shrunk connection will be regarded
> > > > > > as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> > > > > > right all the time.
> > > > > >
> > > > > > The retransmits will become zero-window probes in tcp_retransmit_timer()
> > > > > > if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> > > > > > TCP_RTO_MAX sooner or later.
> > > > > >
> > > > > > However, the timer is not precise enough, as it base on timer wheel.
> > > > > > Sorry that I am not good at timer, but I know the concept of time-wheel.
> > > > > > The longer of the timer, the rougher it will be. So the timeout is not
> > > > > > triggered after TCP_RTO_MAX, but 122877ms as I tested.
> > > > > >
> > > > > > Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> > > > > > once the RTO come up to TCP_RTO_MAX.
> > > > > >
> > > > > > Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> > > > > > which is exact the timestamp of the timeout.
> > > > > >
> > > > > > Signed-off-by: Menglong Dong <imagedong@tencent.com>
> > > > > > ---
> > > > > >  net/ipv4/tcp_timer.c | 6 +++++-
> > > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > > > > > index 470f581eedd4..3a20db15a186 100644
> > > > > > --- a/net/ipv4/tcp_timer.c
> > > > > > +++ b/net/ipv4/tcp_timer.c
> > > > > > @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> > > > > >                                             tp->snd_una, tp->snd_nxt);
> > > > > >                 }
> > > > > >  #endif
> > > > > > -               if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > > +               /* It's a little rough here, we regard any valid packet that
> > > > > > +                * update tp->rcv_tstamp as the reply of the retransmitted
> > > > > > +                * packet.
> > > > > > +                */
> > > > > > +               if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > >                         tcp_write_err(sk);
> > > > > >                         goto out;
> > > > > >                 }
> > > > >
> > > > >
> > > > > Hmm, this looks like a net candidate, since this is unrelated to the
> > > > > other patches ?
> > > >
> > > > Yeah, this patch can be standalone. However, considering the
> > > > purpose of this series, it is necessary. Without this patch, the
> > > > OOM probe will always timeout after a few minutes.
> > > >
> > > > I'm not sure if I express the problem clearly in the commit log.
> > > > Let's explain it more.
> > > >
> > > > Let's mark the timestamp of the 10th timeout of the rtx timer
> > > > as TS1. Now, the retransmission happens and the ACK of
> > > > the retransmitted packet will update the tp->rcv_tstamp to
> > > > TS1+rtt.
> > > >
> > > > The RTO now is TCP_RTO_MAX. So let's see what will
> > > > happen in the 11th timeout. As we timeout after 122877ms,
> > > > so tcp_jiffies32 now is "TS1+122877ms", and
> > > > "tcp_jiffies32 - tp->rcv_tstamp" is
> > > > "TS1+122877ms - (TS1+rtt)" -> "122877ms - rtt",
> > > > which is always bigger than TCP_RTO_MAX, which is 120000ms.
> > > >
> > > > >
> > > > > Neal, what do you think ?
> > >
> > > Sorry, I am probably missing something here, but: what would ever make
> > > this new proposed condition ((u32)icsk->icsk_timeout - tp->rcv_tstamp
> > > > TCP_RTO_MAX) true? :-)
> > >
> >
> > If the snd_wnd is 0, we need to keep probing until the window
> > is available. Meanwhile, any retransmission that don't have
> > a corresponding ACK (see what we do in the 1st patch), which
> > can be caused by the lost of the ACK or the lost of the retransmitted
> > packet, can make the condition true, as the tp->rcv_tstamp can't be
> > updated in time.
> >
> > This is a little strict here. In the tcp_probe_timer(), we are allowed to
> > retransmit the probe0 packet for sysctl_tcp_retries2 times. But
> > we don't allow packets to be lost here.
> >
> > > In your nicely explained scenario, your new expression,
> > > icsk->icsk_timeout - tp->rcv_tstamp, will be:
> > >
> > >   icsk->icsk_timeout - tp->rcv_tstamp
> > > = TS1 + 120 sec      - (TS1+rtt)
> > > = 120 sec - RTT
> > >
> > > AFAICT there is no way for that expression to be bigger than
> > > TCP_RTO_MAX = 120 sec unless somehow RTT is negative. :-)
> > >
> > > So AFAICT your expression ((u32)icsk->icsk_timeout - tp->rcv_tstamp >
> > > TCP_RTO_MAX) will always be false, so rather than this patch we may as
> > > well remove the if check and the body of the if block?
> > >
> >
> > Hmm......as I explained above, the condition will be true
> > if the real packet loss happens. And I think it is the origin
> > design.
> >
> > > To me such a change does not seem like a safe and clear bug fix for
> > > the "net" branch but rather a riskier design change (appropriate for
> > > "net-next" branch) that has connections retry forever when the
> > > receiver retracts the window to zero, under the estimation that this
> > > is preferable to having the connections die in such a case.
> > >
> > > There might be apps that depend on the old behavior of having
> > > connections die in such cases, so we might want to have this new
> > > fail-faster behavior guarded by a sysctl in case some sites need to
> > > revert to the older behavior? Not sure...
> >
> > Yeah, the behavior here will be different for the users. I'm not
> > sure if there are any users that rely on such behavior.
> >
> > What do you think, Eric? Do we need a sysctl here?
> >
>
> I honestly do not know what problem you want to solve.
>
> As Neal pointed out, the new condition would not trigger,
> so one can question about the whole piece of code,
> what is its purpose exactly ?
>
> When receiving WIN 0 acks, we should enter the so called probe0 state.
> Maybe the real issue is that the 'receiver' will falsely send WIN X>0 ACKS,
> because win probe acks do not really ensure memory is available to
> receive new packets ?
>
> Maybe provide a packetdrill test to show what kind of issue you are facing...
>
> In Google kernels, we have TCP_MAX_RTO reduced to 30 seconds, and the
> following test runs fine.
>
> // Test how sender reacts to unexpected arrival rwin of 0.
>
> `../common/defaults.sh`
>
> // Create a socket.
>     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>    +0 bind(3, ..., ...) = 0
>    +0 listen(3, 1) = 0
>
> // Establish a connection.
>   +.1 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 6>
>    +0 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,nop,sackOK,nop,wscale 8>
>   +.1 < . 1:1(0) ack 1 win 457
>    +0 accept(3, ..., ...) = 4
>
>    +0 write(4, ..., 20000) = 20000
>    +0 > P. 1:10001(10000) ack 1
>   +.1 < . 1:1(0) ack 10001 win 0
>
> // Send zwp since we received rwin of 0 and have data to send.
>   +.3 > . 10000:10000(0) ack 1
>   +.1 < . 1:1(0) ack 10001 win 0
>
>   +.6 > . 10000:10000(0) ack 1
>   +.1 < . 1:1(0) ack 10001 win 0
>
>  +1.2 > . 10000:10000(0) ack 1
>   +.1 < . 1:1(0) ack 10001 win 0
>
>  +2.4 > . 10000:10000(0) ack 1
>   +.1 < . 1:1(0) ack 10001 win  0
>
>  +4.8 > . 10000:10000(0) ack 1
>   +.1 < . 1:1(0) ack 10001 win  0
>
>  +9.6 > . 10000:10000(0) ack 1
>   +.1 < . 1:1(0) ack 10001 win  0
>
> +19.2 > . 10000:10000(0) ack 1
>   +.1 < . 1:1(0) ack 10001 win 0
>
> +30   > . 10000:10000(0) ack 1
>   +.1 < . 1:1(0) ack 10001 win 0
>
> +30   > . 10000:10000(0) ack 1
>   +.1 < . 1:1(0) ack 10001 win 193
>
> // Received non-zero window update. Send rest of the data.
>    +0 > P. 10001:20001(10000) ack 1
>   +.1 < . 1:1(0) ack 20001 win 457

In that packetdrill case AFAICT that is the ZWP timer firing, and the
sender sends a ZWP.

I think maybe Menglong is looking more at something like the following
scenario, where at the time the RTO timer fires the data sender finds
the tp->snd_wnd is zero, so it sends a retransmit of the
lowest-sequence data packet.

Here is a packetdrill case and the tcpdump trace on an upstream
net-next kernel... I have not worked out all the details at the end,
but perhaps it can help move the discussion forward:


~/packetdrill/gtests/net/tcp/receiver_window# cat rwin-rto-zero-window.pkt
// Test how sender reacts to unexpected arrival rwin of 0.

`../common/defaults.sh`

// Create a socket.
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

// Establish a connection.
  +.1 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 6>
   +0 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,nop,sackOK,nop,wscale 14>
  +.1 < . 1:1(0) ack 1 win 457
   +0 accept(3, ..., ...) = 4

   +0 write(4, ..., 20000) = 20000
   +0 > P. 1:10001(10000) ack 1

// TLP
  +.2 > . 10001:11001(1000) ack 1
// Receiver has retracted rwin to 0
// (perhaps from the 2023 proposed OOM code?).
  +.1 < . 1:1(0) ack 1 win 0

// RTO, and in tcp_retransmit_timer() we see the receiver window is zero,
// so we take the special f (!tp->snd_wnd...) code path.
  +.2 > . 1:1001(1000) ack 1
  +.1 < . 1:1(0) ack 1 win 0

  +.5 > . 1:1001(1000) ack 1
  +.1 < . 1:1(0) ack 1 win 0

 +1.2 > . 1:1001(1000) ack 1
  +.1 < . 1:1(0) ack 1 win 0

 +2.4 > . 1:1001(1000) ack 1
  +.1 < . 1:1(0) ack 1 win 0

 +4.8 > . 1:1001(1000) ack 1
  +.1 < . 1:1(0) ack 1 win 0

 +9.6 > . 1:1001(1000) ack 1
  +.1 < . 1:1(0) ack 1 win 0

+19.2 > . 1:1001(1000) ack 1
  +.1 < . 1:1(0) ack 1 win 0

+38.4 > . 1:1001(1000) ack 1
  +.1 < . 1:1(0) ack 1 win 0

+76.8 > . 1:1001(1000) ack 1
  +.1 < . 1:1(0) ack 1 win 0

+120 > . 1:1001(1000) ack 1
 +.1 < . 1:1(0) ack 1 win 0

+120 > . 1:1001(1000) ack 1
 +.1 < . 1:1(0) ack 1001 win 1000

// Received non-zero window update. Send more data.
  +0 > P. 1001:3001(2000) ack 1
 +.1 < . 1:1(0) ack 3001 win 1000

----------
When I run that script on a net-next kernel I see the rounding up of
the RTO to 122 secs rather than 120 secs, but for whatever reason the
script does not cause the socket to die early...

The tcpdump trace:

 tcpdump -ttt -n -i any port 8080 &

->

~/packetdrill/gtests/net/tcp/receiver_window#
../../packetdrill/packetdrill rwin-rto-zero-window.pkt
 00:01:01.370344 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [S], seq 0, win 65535, options [mss
1000,nop,nop,sackOK,nop,wscale 6], length 0
 00:00:00.000096 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [S.], seq 3847169154, ack 1, win 65535, options [mss
1460,nop,nop,sackOK,nop,wscale 14], length 0
 00:00:00.100277 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 457, length 0
 00:00:00.000090 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [P.], seq 1:2001, ack 1, win 4, length 2000: HTTP
 00:00:00.000006 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [P.], seq 2001:4001, ack 1, win 4, length 2000: HTTP
 00:00:00.000003 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [P.], seq 4001:6001, ack 1, win 4, length 2000: HTTP
 00:00:00.000002 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [P.], seq 6001:8001, ack 1, win 4, length 2000: HTTP
 00:00:00.000001 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [P.], seq 8001:10001, ack 1, win 4, length 2000: HTTP
 00:00:00.209131 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 10001:11001, ack 1, win 4, length 1000: HTTP
 00:00:00.100190 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
 00:00:00.203824 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
 00:00:00.100175 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
 00:00:00.507835 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
 00:00:00.100192 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
 00:00:01.115858 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
 00:00:00.100182 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
 00:00:02.331747 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
 00:00:00.100198 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
 00:00:04.955980 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
 00:00:00.100197 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
 00:00:09.627985 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
 00:00:00.100179 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
 00:00:19.355725 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
 00:00:00.100203 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
 00:00:42.395633 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
 00:00:00.100202 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
 00:01:17.724059 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
 00:00:00.100201 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
 00:02:02.779516 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
 00:00:00.100229 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1, win 0, length 0
 00:02:02.779828 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
 00:00:00.100230 ?     In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
Flags [.], ack 1001, win 1000, length 0
 00:00:00.000034 ?     Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 11001:12001, ack 1, win 4, length 1000: HTTP
 00:00:00.000005 ?     Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
Flags [.], seq 12001:13001, ack 1, win 4, length 1000: HTTP

rwin-rto-zero-window.pkt:62: error handling packet: live packet field
tcp_psh: expected: 1 (0x1) vs actual: 0 (0x0)
script packet: 405.390244 P. 1001:3001(2000) ack 1
actual packet: 405.390237 . 11001:13001(2000) ack 1 win 4

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()
  2023-07-28 14:25             ` Neal Cardwell
@ 2023-07-28 17:15               ` Neal Cardwell
  2023-07-31  3:41                 ` Menglong Dong
  2023-07-31  9:49                 ` Menglong Dong
  2023-07-31  3:28               ` Menglong Dong
  2023-07-31  8:24               ` Menglong Dong
  2 siblings, 2 replies; 17+ messages in thread
From: Neal Cardwell @ 2023-07-28 17:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Menglong Dong, davem, kuba, pabeni, dsahern, netdev, linux-kernel,
	Menglong Dong, Yuchung Cheng, Soheil Hassas Yeganeh

On Fri, Jul 28, 2023 at 7:25 AM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Fri, Jul 28, 2023 at 1:50 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Jul 28, 2023 at 8:25 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > >
> > > On Fri, Jul 28, 2023 at 12:44 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > >
> > > > On Thu, Jul 27, 2023 at 7:57 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jul 28, 2023 at 3:31 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > > >
> > > > > > On Thu, Jul 27, 2023 at 2:52 PM <menglong8.dong@gmail.com> wrote:
> > > > > > >
> > > > > > > From: Menglong Dong <imagedong@tencent.com>
> > > > > > >
> > > > > > > In tcp_retransmit_timer(), a window shrunk connection will be regarded
> > > > > > > as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> > > > > > > right all the time.
> > > > > > >
> > > > > > > The retransmits will become zero-window probes in tcp_retransmit_timer()
> > > > > > > if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> > > > > > > TCP_RTO_MAX sooner or later.
> > > > > > >
> > > > > > > However, the timer is not precise enough, as it base on timer wheel.
> > > > > > > Sorry that I am not good at timer, but I know the concept of time-wheel.
> > > > > > > The longer of the timer, the rougher it will be. So the timeout is not
> > > > > > > triggered after TCP_RTO_MAX, but 122877ms as I tested.
> > > > > > >
> > > > > > > Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> > > > > > > once the RTO come up to TCP_RTO_MAX.
> > > > > > >
> > > > > > > Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> > > > > > > which is exact the timestamp of the timeout.
> > > > > > >
> > > > > > > Signed-off-by: Menglong Dong <imagedong@tencent.com>
> > > > > > > ---
> > > > > > >  net/ipv4/tcp_timer.c | 6 +++++-
> > > > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > > > > > > index 470f581eedd4..3a20db15a186 100644
> > > > > > > --- a/net/ipv4/tcp_timer.c
> > > > > > > +++ b/net/ipv4/tcp_timer.c
> > > > > > > @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> > > > > > >                                             tp->snd_una, tp->snd_nxt);
> > > > > > >                 }
> > > > > > >  #endif
> > > > > > > -               if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > > > +               /* It's a little rough here, we regard any valid packet that
> > > > > > > +                * update tp->rcv_tstamp as the reply of the retransmitted
> > > > > > > +                * packet.
> > > > > > > +                */
> > > > > > > +               if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > > >                         tcp_write_err(sk);
> > > > > > >                         goto out;
> > > > > > >                 }

One potential pre-existing issue with this logic: if the connection is
restarting from idle, then tp->rcv_tstamp could already be a long time
(minutes or hours) in the past even on the first RTO, in which case
the very first RTO that found a zero tp->snd_wnd  would find this
check returns true, and would destroy the connection immediately. This
seems extremely brittle.

AFAICT it would be safer to replace this logic with a call to the
standard tcp_write_timeout() logic that has a more robust check to see
if the connection should be destroyed.

neal

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()
  2023-07-28 14:25             ` Neal Cardwell
  2023-07-28 17:15               ` Neal Cardwell
@ 2023-07-31  3:28               ` Menglong Dong
  2023-07-31  8:24               ` Menglong Dong
  2 siblings, 0 replies; 17+ messages in thread
From: Menglong Dong @ 2023-07-31  3:28 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, davem, kuba, pabeni, dsahern, netdev, linux-kernel,
	Menglong Dong, Yuchung Cheng, Soheil Hassas Yeganeh

On Fri, Jul 28, 2023 at 10:25 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Fri, Jul 28, 2023 at 1:50 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Jul 28, 2023 at 8:25 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > >
> > > On Fri, Jul 28, 2023 at 12:44 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > >
> > > > On Thu, Jul 27, 2023 at 7:57 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jul 28, 2023 at 3:31 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > > >
> > > > > > On Thu, Jul 27, 2023 at 2:52 PM <menglong8.dong@gmail.com> wrote:
> > > > > > >
> > > > > > > From: Menglong Dong <imagedong@tencent.com>
> > > > > > >
> > > > > > > In tcp_retransmit_timer(), a window shrunk connection will be regarded
> > > > > > > as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> > > > > > > right all the time.
> > > > > > >
> > > > > > > The retransmits will become zero-window probes in tcp_retransmit_timer()
> > > > > > > if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> > > > > > > TCP_RTO_MAX sooner or later.
> > > > > > >
> > > > > > > However, the timer is not precise enough, as it base on timer wheel.
> > > > > > > Sorry that I am not good at timer, but I know the concept of time-wheel.
> > > > > > > The longer of the timer, the rougher it will be. So the timeout is not
> > > > > > > triggered after TCP_RTO_MAX, but 122877ms as I tested.
> > > > > > >
> > > > > > > Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> > > > > > > once the RTO come up to TCP_RTO_MAX.
> > > > > > >
> > > > > > > Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> > > > > > > which is exact the timestamp of the timeout.
> > > > > > >
> > > > > > > Signed-off-by: Menglong Dong <imagedong@tencent.com>
> > > > > > > ---
> > > > > > >  net/ipv4/tcp_timer.c | 6 +++++-
> > > > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > > > > > > index 470f581eedd4..3a20db15a186 100644
> > > > > > > --- a/net/ipv4/tcp_timer.c
> > > > > > > +++ b/net/ipv4/tcp_timer.c
> > > > > > > @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> > > > > > >                                             tp->snd_una, tp->snd_nxt);
> > > > > > >                 }
> > > > > > >  #endif
> > > > > > > -               if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > > > +               /* It's a little rough here, we regard any valid packet that
> > > > > > > +                * update tp->rcv_tstamp as the reply of the retransmitted
> > > > > > > +                * packet.
> > > > > > > +                */
> > > > > > > +               if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > > >                         tcp_write_err(sk);
> > > > > > >                         goto out;
> > > > > > >                 }
> > > > > >
> > > > > >
> > > > > > Hmm, this looks like a net candidate, since this is unrelated to the
> > > > > > other patches ?
> > > > >
> > > > > Yeah, this patch can be standalone. However, considering the
> > > > > purpose of this series, it is necessary. Without this patch, the
> > > > > OOM probe will always timeout after a few minutes.
> > > > >
> > > > > I'm not sure if I express the problem clearly in the commit log.
> > > > > Let's explain it more.
> > > > >
> > > > > Let's mark the timestamp of the 10th timeout of the rtx timer
> > > > > as TS1. Now, the retransmission happens and the ACK of
> > > > > the retransmitted packet will update the tp->rcv_tstamp to
> > > > > TS1+rtt.
> > > > >
> > > > > The RTO now is TCP_RTO_MAX. So let's see what will
> > > > > happen in the 11th timeout. As we timeout after 122877ms,
> > > > > so tcp_jiffies32 now is "TS1+122877ms", and
> > > > > "tcp_jiffies32 - tp->rcv_tstamp" is
> > > > > "TS1+122877ms - (TS1+rtt)" -> "122877ms - rtt",
> > > > > which is always bigger than TCP_RTO_MAX, which is 120000ms.
> > > > >
> > > > > >
> > > > > > Neal, what do you think ?
> > > >
> > > > Sorry, I am probably missing something here, but: what would ever make
> > > > this new proposed condition ((u32)icsk->icsk_timeout - tp->rcv_tstamp
> > > > > TCP_RTO_MAX) true? :-)
> > > >
> > >
> > > If the snd_wnd is 0, we need to keep probing until the window
> > > is available. Meanwhile, any retransmission that don't have
> > > a corresponding ACK (see what we do in the 1st patch), which
> > > can be caused by the lost of the ACK or the lost of the retransmitted
> > > packet, can make the condition true, as the tp->rcv_tstamp can't be
> > > updated in time.
> > >
> > > This is a little strict here. In the tcp_probe_timer(), we are allowed to
> > > retransmit the probe0 packet for sysctl_tcp_retries2 times. But
> > > we don't allow packets to be lost here.
> > >
> > > > In your nicely explained scenario, your new expression,
> > > > icsk->icsk_timeout - tp->rcv_tstamp, will be:
> > > >
> > > >   icsk->icsk_timeout - tp->rcv_tstamp
> > > > = TS1 + 120 sec      - (TS1+rtt)
> > > > = 120 sec - RTT
> > > >
> > > > AFAICT there is no way for that expression to be bigger than
> > > > TCP_RTO_MAX = 120 sec unless somehow RTT is negative. :-)
> > > >
> > > > So AFAICT your expression ((u32)icsk->icsk_timeout - tp->rcv_tstamp >
> > > > TCP_RTO_MAX) will always be false, so rather than this patch we may as
> > > > well remove the if check and the body of the if block?
> > > >
> > >
> > > Hmm......as I explained above, the condition will be true
> > > if the real packet loss happens. And I think it is the origin
> > > design.
> > >
> > > > To me such a change does not seem like a safe and clear bug fix for
> > > > the "net" branch but rather a riskier design change (appropriate for
> > > > "net-next" branch) that has connections retry forever when the
> > > > receiver retracts the window to zero, under the estimation that this
> > > > is preferable to having the connections die in such a case.
> > > >
> > > > There might be apps that depend on the old behavior of having
> > > > connections die in such cases, so we might want to have this new
> > > > fail-faster behavior guarded by a sysctl in case some sites need to
> > > > revert to the older behavior? Not sure...
> > >
> > > Yeah, the behavior here will be different for the users. I'm not
> > > sure if there are any users that rely on such behavior.
> > >
> > > What do you think, Eric? Do we need a sysctl here?
> > >
> >
> > I honestly do not know what problem you want to solve.
> >
> > As Neal pointed out, the new condition would not trigger,
> > so one can question about the whole piece of code,
> > what is its purpose exactly ?
> >
> > When receiving WIN 0 acks, we should enter the so called probe0 state.
> > Maybe the real issue is that the 'receiver' will falsely send WIN X>0 ACKS,
> > because win probe acks do not really ensure memory is available to
> > receive new packets ?
> >
> > Maybe provide a packetdrill test to show what kind of issue you are facing...
> >
> > In Google kernels, we have TCP_MAX_RTO reduced to 30 seconds, and the
> > following test runs fine.
> >
> > // Test how sender reacts to unexpected arrival rwin of 0.
> >
> > `../common/defaults.sh`
> >
> > // Create a socket.
> >     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> >    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> >    +0 bind(3, ..., ...) = 0
> >    +0 listen(3, 1) = 0
> >
> > // Establish a connection.
> >   +.1 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 6>
> >    +0 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,nop,sackOK,nop,wscale 8>
> >   +.1 < . 1:1(0) ack 1 win 457
> >    +0 accept(3, ..., ...) = 4
> >
> >    +0 write(4, ..., 20000) = 20000
> >    +0 > P. 1:10001(10000) ack 1
> >   +.1 < . 1:1(0) ack 10001 win 0
> >
> > // Send zwp since we received rwin of 0 and have data to send.
> >   +.3 > . 10000:10000(0) ack 1
> >   +.1 < . 1:1(0) ack 10001 win 0
> >
> >   +.6 > . 10000:10000(0) ack 1
> >   +.1 < . 1:1(0) ack 10001 win 0
> >
> >  +1.2 > . 10000:10000(0) ack 1
> >   +.1 < . 1:1(0) ack 10001 win 0
> >
> >  +2.4 > . 10000:10000(0) ack 1
> >   +.1 < . 1:1(0) ack 10001 win  0
> >
> >  +4.8 > . 10000:10000(0) ack 1
> >   +.1 < . 1:1(0) ack 10001 win  0
> >
> >  +9.6 > . 10000:10000(0) ack 1
> >   +.1 < . 1:1(0) ack 10001 win  0
> >
> > +19.2 > . 10000:10000(0) ack 1
> >   +.1 < . 1:1(0) ack 10001 win 0
> >
> > +30   > . 10000:10000(0) ack 1
> >   +.1 < . 1:1(0) ack 10001 win 0
> >
> > +30   > . 10000:10000(0) ack 1
> >   +.1 < . 1:1(0) ack 10001 win 193
> >
> > // Received non-zero window update. Send rest of the data.
> >    +0 > P. 10001:20001(10000) ack 1
> >   +.1 < . 1:1(0) ack 20001 win 457
>
> In that packetdrill case AFAICT that is the ZWP timer firing, and the
> sender sends a ZWP.
>
> I think maybe Menglong is looking more at something like the following
> scenario, where at the time the RTO timer fires the data sender finds
> the tp->snd_wnd is zero, so it sends a retransmit of the
> lowest-sequence data packet.
>

Sorry for the late reply, I'm traveling on the weekend for two days.

Yes, I think it is close to my scenario. The sender will send
a retransmit of the lowest-sequence data packet to probe
the window.

> Here is a packetdrill case and the tcpdump trace on an upstream
> net-next kernel... I have not worked out all the details at the end,
> but perhaps it can help move the discussion forward:
>

The packetdrill is more amazing than I thought, I never
thought it could do such a thing.

>
> ~/packetdrill/gtests/net/tcp/receiver_window# cat rwin-rto-zero-window.pkt
> // Test how sender reacts to unexpected arrival rwin of 0.
>
> `../common/defaults.sh`
>
> // Create a socket.
>     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>    +0 bind(3, ..., ...) = 0
>    +0 listen(3, 1) = 0
>
> // Establish a connection.
>   +.1 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 6>
>    +0 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,nop,sackOK,nop,wscale 14>
>   +.1 < . 1:1(0) ack 1 win 457
>    +0 accept(3, ..., ...) = 4
>
>    +0 write(4, ..., 20000) = 20000
>    +0 > P. 1:10001(10000) ack 1
>
> // TLP
>   +.2 > . 10001:11001(1000) ack 1
> // Receiver has retracted rwin to 0
> // (perhaps from the 2023 proposed OOM code?).
>   +.1 < . 1:1(0) ack 1 win 0
>
> // RTO, and in tcp_retransmit_timer() we see the receiver window is zero,
> // so we take the special f (!tp->snd_wnd...) code path.
>   +.2 > . 1:1001(1000) ack 1
>   +.1 < . 1:1(0) ack 1 win 0
>
>   +.5 > . 1:1001(1000) ack 1
>   +.1 < . 1:1(0) ack 1 win 0
>
>  +1.2 > . 1:1001(1000) ack 1
>   +.1 < . 1:1(0) ack 1 win 0
>
>  +2.4 > . 1:1001(1000) ack 1
>   +.1 < . 1:1(0) ack 1 win 0
>
>  +4.8 > . 1:1001(1000) ack 1
>   +.1 < . 1:1(0) ack 1 win 0
>
>  +9.6 > . 1:1001(1000) ack 1
>   +.1 < . 1:1(0) ack 1 win 0
>
> +19.2 > . 1:1001(1000) ack 1
>   +.1 < . 1:1(0) ack 1 win 0
>
> +38.4 > . 1:1001(1000) ack 1
>   +.1 < . 1:1(0) ack 1 win 0
>
> +76.8 > . 1:1001(1000) ack 1
>   +.1 < . 1:1(0) ack 1 win 0
>
> +120 > . 1:1001(1000) ack 1
>  +.1 < . 1:1(0) ack 1 win 0
>
> +120 > . 1:1001(1000) ack 1
>  +.1 < . 1:1(0) ack 1001 win 1000
>
> // Received non-zero window update. Send more data.
>   +0 > P. 1001:3001(2000) ack 1
>  +.1 < . 1:1(0) ack 3001 win 1000
>
> ----------
> When I run that script on a net-next kernel I see the rounding up of
> the RTO to 122 secs rather than 120 secs, but for whatever reason the
> script does not cause the socket to die early...
>

Enn....it should die, let me dig deeper into the script first.
The script looks nice, I need a little time to learn it.

Thanks!
Menglong Dong

> The tcpdump trace:
>
>  tcpdump -ttt -n -i any port 8080 &
>
> ->
>
> ~/packetdrill/gtests/net/tcp/receiver_window#
> ../../packetdrill/packetdrill rwin-rto-zero-window.pkt
>  00:01:01.370344 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [S], seq 0, win 65535, options [mss
> 1000,nop,nop,sackOK,nop,wscale 6], length 0
>  00:00:00.000096 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [S.], seq 3847169154, ack 1, win 65535, options [mss
> 1460,nop,nop,sackOK,nop,wscale 14], length 0
>  00:00:00.100277 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 457, length 0
>  00:00:00.000090 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 1:2001, ack 1, win 4, length 2000: HTTP
>  00:00:00.000006 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 2001:4001, ack 1, win 4, length 2000: HTTP
>  00:00:00.000003 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 4001:6001, ack 1, win 4, length 2000: HTTP
>  00:00:00.000002 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 6001:8001, ack 1, win 4, length 2000: HTTP
>  00:00:00.000001 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 8001:10001, ack 1, win 4, length 2000: HTTP
>  00:00:00.209131 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 10001:11001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100190 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:00:00.203824 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100175 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:00:00.507835 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100192 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:00:01.115858 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100182 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:00:02.331747 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100198 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:00:04.955980 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100197 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:00:09.627985 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100179 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:00:19.355725 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100203 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:00:42.395633 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100202 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:01:17.724059 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100201 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:02:02.779516 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100229 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:02:02.779828 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100230 ?     In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1001, win 1000, length 0
>  00:00:00.000034 ?     Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 11001:12001, ack 1, win 4, length 1000: HTTP
>  00:00:00.000005 ?     Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 12001:13001, ack 1, win 4, length 1000: HTTP
>
> rwin-rto-zero-window.pkt:62: error handling packet: live packet field
> tcp_psh: expected: 1 (0x1) vs actual: 0 (0x0)
> script packet: 405.390244 P. 1001:3001(2000) ack 1
> actual packet: 405.390237 . 11001:13001(2000) ack 1 win 4

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()
  2023-07-28 17:15               ` Neal Cardwell
@ 2023-07-31  3:41                 ` Menglong Dong
  2023-07-31  9:49                 ` Menglong Dong
  1 sibling, 0 replies; 17+ messages in thread
From: Menglong Dong @ 2023-07-31  3:41 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, davem, kuba, pabeni, dsahern, netdev, linux-kernel,
	Menglong Dong, Yuchung Cheng, Soheil Hassas Yeganeh

On Sat, Jul 29, 2023 at 1:15 AM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Fri, Jul 28, 2023 at 7:25 AM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Fri, Jul 28, 2023 at 1:50 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Fri, Jul 28, 2023 at 8:25 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > > >
> > > > On Fri, Jul 28, 2023 at 12:44 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > >
> > > > > On Thu, Jul 27, 2023 at 7:57 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Jul 28, 2023 at 3:31 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > > > >
> > > > > > > On Thu, Jul 27, 2023 at 2:52 PM <menglong8.dong@gmail.com> wrote:
> > > > > > > >
> > > > > > > > From: Menglong Dong <imagedong@tencent.com>
> > > > > > > >
> > > > > > > > In tcp_retransmit_timer(), a window shrunk connection will be regarded
> > > > > > > > as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> > > > > > > > right all the time.
> > > > > > > >
> > > > > > > > The retransmits will become zero-window probes in tcp_retransmit_timer()
> > > > > > > > if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> > > > > > > > TCP_RTO_MAX sooner or later.
> > > > > > > >
> > > > > > > > However, the timer is not precise enough, as it base on timer wheel.
> > > > > > > > Sorry that I am not good at timer, but I know the concept of time-wheel.
> > > > > > > > The longer of the timer, the rougher it will be. So the timeout is not
> > > > > > > > triggered after TCP_RTO_MAX, but 122877ms as I tested.
> > > > > > > >
> > > > > > > > Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> > > > > > > > once the RTO come up to TCP_RTO_MAX.
> > > > > > > >
> > > > > > > > Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> > > > > > > > which is exact the timestamp of the timeout.
> > > > > > > >
> > > > > > > > Signed-off-by: Menglong Dong <imagedong@tencent.com>
> > > > > > > > ---
> > > > > > > >  net/ipv4/tcp_timer.c | 6 +++++-
> > > > > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > > > >
> > > > > > > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > > > > > > > index 470f581eedd4..3a20db15a186 100644
> > > > > > > > --- a/net/ipv4/tcp_timer.c
> > > > > > > > +++ b/net/ipv4/tcp_timer.c
> > > > > > > > @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> > > > > > > >                                             tp->snd_una, tp->snd_nxt);
> > > > > > > >                 }
> > > > > > > >  #endif
> > > > > > > > -               if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > > > > +               /* It's a little rough here, we regard any valid packet that
> > > > > > > > +                * update tp->rcv_tstamp as the reply of the retransmitted
> > > > > > > > +                * packet.
> > > > > > > > +                */
> > > > > > > > +               if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > > > >                         tcp_write_err(sk);
> > > > > > > >                         goto out;
> > > > > > > >                 }
>
> One potential pre-existing issue with this logic: if the connection is
> restarting from idle, then tp->rcv_tstamp could already be a long time
> (minutes or hours) in the past even on the first RTO, in which case
> the very first RTO that found a zero tp->snd_wnd  would find this
> check returns true, and would destroy the connection immediately. This
> seems extremely brittle.

Yes, this scenario can happen and cause the connection
break unexpectedly.

>
> AFAICT it would be safer to replace this logic with a call to the
> standard tcp_write_timeout() logic that has a more robust check to see
> if the connection should be destroyed.

Yes, we need a more robust check here. But I think tcp_write_timeout()
is not a good choice. The icsk->icsk_retransmits won't increase and
can keep being 0 in this scenario, which makes tcp_write_timeout()
always return 0.

Enn...let me think again.

>
> neal

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()
  2023-07-28 14:25             ` Neal Cardwell
  2023-07-28 17:15               ` Neal Cardwell
  2023-07-31  3:28               ` Menglong Dong
@ 2023-07-31  8:24               ` Menglong Dong
  2 siblings, 0 replies; 17+ messages in thread
From: Menglong Dong @ 2023-07-31  8:24 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, davem, kuba, pabeni, dsahern, netdev, linux-kernel,
	Menglong Dong, Yuchung Cheng, Soheil Hassas Yeganeh

On Fri, Jul 28, 2023 at 10:25 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Fri, Jul 28, 2023 at 1:50 AM Eric Dumazet <edumazet@google.com> wrote:
[...]
>
> In that packetdrill case AFAICT that is the ZWP timer firing, and the
> sender sends a ZWP.
>
> I think maybe Menglong is looking more at something like the following
> scenario, where at the time the RTO timer fires the data sender finds
> the tp->snd_wnd is zero, so it sends a retransmit of the
> lowest-sequence data packet.
>
> Here is a packetdrill case and the tcpdump trace on an upstream
> net-next kernel... I have not worked out all the details at the end,
> but perhaps it can help move the discussion forward:
>
>
> ~/packetdrill/gtests/net/tcp/receiver_window# cat rwin-rto-zero-window.pkt
> // Test how sender reacts to unexpected arrival rwin of 0.
>
> `../common/defaults.sh`
>
> // Create a socket.
>     0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>    +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>    +0 bind(3, ..., ...) = 0
>    +0 listen(3, 1) = 0
>
> // Establish a connection.
>   +.1 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 6>
>    +0 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,nop,sackOK,nop,wscale 14>
>   +.1 < . 1:1(0) ack 1 win 457
>    +0 accept(3, ..., ...) = 4
>
>    +0 write(4, ..., 20000) = 20000
>    +0 > P. 1:10001(10000) ack 1
>
> // TLP
>   +.2 > . 10001:11001(1000) ack 1
> // Receiver has retracted rwin to 0
> // (perhaps from the 2023 proposed OOM code?).
>   +.1 < . 1:1(0) ack 1 win 0
>
> // RTO, and in tcp_retransmit_timer() we see the receiver window is zero,
> // so we take the special f (!tp->snd_wnd...) code path.
>   +.2 > . 1:1001(1000) ack 1
>   +.1 < . 1:1(0) ack 1 win 0
>
>   +.5 > . 1:1001(1000) ack 1
>   +.1 < . 1:1(0) ack 1 win 0
>
>  +1.2 > . 1:1001(1000) ack 1
>   +.1 < . 1:1(0) ack 1 win 0
>
>  +2.4 > . 1:1001(1000) ack 1
>   +.1 < . 1:1(0) ack 1 win 0
>
>  +4.8 > . 1:1001(1000) ack 1
>   +.1 < . 1:1(0) ack 1 win 0
>
>  +9.6 > . 1:1001(1000) ack 1
>   +.1 < . 1:1(0) ack 1 win 0
>
> +19.2 > . 1:1001(1000) ack 1
>   +.1 < . 1:1(0) ack 1 win 0
>
> +38.4 > . 1:1001(1000) ack 1
>   +.1 < . 1:1(0) ack 1 win 0
>
> +76.8 > . 1:1001(1000) ack 1
>   +.1 < . 1:1(0) ack 1 win 0
>
> +120 > . 1:1001(1000) ack 1
>  +.1 < . 1:1(0) ack 1 win 0
>
> +120 > . 1:1001(1000) ack 1
>  +.1 < . 1:1(0) ack 1001 win 1000
>
> // Received non-zero window update. Send more data.
>   +0 > P. 1001:3001(2000) ack 1
>  +.1 < . 1:1(0) ack 3001 win 1000
>
> ----------
> When I run that script on a net-next kernel I see the rounding up of
> the RTO to 122 secs rather than 120 secs, but for whatever reason the
> script does not cause the socket to die early...
>

I think I know the reason now. Without the 2nd patches that I send
in this series, the ACK can't update the rwin to 0, as it will be ignored
in tcp_may_update_window().

However, you can send an ACK that acknowledges the new data
to update the rwin to 0. I modified your script, and it can die as we
excepted:

// Test how sender reacts to unexpected arrival rwin of 0.

`../common/defaults.sh`

// Create a socket.
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

// Establish a connection.
  +.1 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 6>
   +0 > S. 0:0(0) ack 1 win 65535 <mss 1440,nop,nop,sackOK,nop,wscale 8>
  +.1 < . 1:1(0) ack 1 win 457
   +0 accept(3, ..., ...) = 4

   +0 write(4, ..., 20000) = 20000
   +0 > P. 1:10001(10000) ack 1

// Update the window to 0. "ack 0 win 0" won't update the window, as it
// will be ignored by tcp_may_update_window()
  +.1 < . 1:1(0) ack 1001 win 0

// RTO, and in tcp_retransmit_timer() we see the receiver window is zero,
// so we take the special f (!tp->snd_wnd...) code path.
  +.2 > . 1001:2001(1000) ack 1
  +.1 < . 1:1(0) ack 1001 win 0

  +.5 > . 1001:2001(1000) ack 1
  +.1 < . 1:1(0) ack 1001 win 0

 +1.2 > . 1001:2001(1000) ack 1
  +.1 < . 1:1(0) ack 1001 win 0

 +2.4 > . 1001:2001(1000) ack 1
  +.1 < . 1:1(0) ack 1001 win 0

 +4.8 > . 1001:2001(1000) ack 1
  +.1 < . 1:1(0) ack 1001 win 0

 +9.6 > . 1001:2001(1000) ack 1
  +.1 < . 1:1(0) ack 1001 win 0

+19.2 > . 1001:2001(1000) ack 1
  +.1 < . 1:1(0) ack 1001 win 0

+38.4 > . 1001:2001(1000) ack 1
  +.1 < . 1:1(0) ack 1001 win 0

+76.8 > . 1001:2001(1000) ack 1
  +.1 < . 1:1(0) ack 1001 win 0

// socket will die in tcp_retransmit_timer() in the
// "tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX" code path.
// Following retransmit won't happen.
+120 > . 1001:2001(1000) ack 1
 +.1 < . 1:1(0) ack 1001 win 0
------------------------------------------------------------------------------

I don't know how to check the die of socket with
packetdrill, so I checked it by:
  ss -nitme | grep 8080 | grep on
And I can see the socket die after timeout of the 120seconds
timer.

$ packetdrill ./rwin-rto-zero-window.pkt
./rwin-rto-zero-window.pkt:55: error handling packet: Timed out
waiting for packet


> The tcpdump trace:
>
>  tcpdump -ttt -n -i any port 8080 &
>
> ->
>
> ~/packetdrill/gtests/net/tcp/receiver_window#
> ../../packetdrill/packetdrill rwin-rto-zero-window.pkt
>  00:01:01.370344 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [S], seq 0, win 65535, options [mss
> 1000,nop,nop,sackOK,nop,wscale 6], length 0
>  00:00:00.000096 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [S.], seq 3847169154, ack 1, win 65535, options [mss
> 1460,nop,nop,sackOK,nop,wscale 14], length 0
>  00:00:00.100277 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 457, length 0
>  00:00:00.000090 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 1:2001, ack 1, win 4, length 2000: HTTP
>  00:00:00.000006 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 2001:4001, ack 1, win 4, length 2000: HTTP
>  00:00:00.000003 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 4001:6001, ack 1, win 4, length 2000: HTTP
>  00:00:00.000002 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 6001:8001, ack 1, win 4, length 2000: HTTP
>  00:00:00.000001 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [P.], seq 8001:10001, ack 1, win 4, length 2000: HTTP
>  00:00:00.209131 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 10001:11001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100190 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:00:00.203824 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100175 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:00:00.507835 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100192 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:00:01.115858 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100182 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:00:02.331747 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100198 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:00:04.955980 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100197 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:00:09.627985 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100179 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:00:19.355725 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100203 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:00:42.395633 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100202 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:01:17.724059 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100201 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:02:02.779516 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100229 tun0  In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1, win 0, length 0
>  00:02:02.779828 tun0  Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 1:1001, ack 1, win 4, length 1000: HTTP
>  00:00:00.100230 ?     In  IP 192.0.2.1.51231 > 192.168.56.132.8080:
> Flags [.], ack 1001, win 1000, length 0
>  00:00:00.000034 ?     Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 11001:12001, ack 1, win 4, length 1000: HTTP
>  00:00:00.000005 ?     Out IP 192.168.56.132.8080 > 192.0.2.1.51231:
> Flags [.], seq 12001:13001, ack 1, win 4, length 1000: HTTP
>
> rwin-rto-zero-window.pkt:62: error handling packet: live packet field
> tcp_psh: expected: 1 (0x1) vs actual: 0 (0x0)
> script packet: 405.390244 P. 1001:3001(2000) ack 1
> actual packet: 405.390237 . 11001:13001(2000) ack 1 win 4

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer()
  2023-07-28 17:15               ` Neal Cardwell
  2023-07-31  3:41                 ` Menglong Dong
@ 2023-07-31  9:49                 ` Menglong Dong
  1 sibling, 0 replies; 17+ messages in thread
From: Menglong Dong @ 2023-07-31  9:49 UTC (permalink / raw)
  To: Neal Cardwell, Eric Dumazet
  Cc: davem, kuba, pabeni, dsahern, netdev, linux-kernel, Menglong Dong,
	Yuchung Cheng, Soheil Hassas Yeganeh

On Sat, Jul 29, 2023 at 1:15 AM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Fri, Jul 28, 2023 at 7:25 AM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Fri, Jul 28, 2023 at 1:50 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Fri, Jul 28, 2023 at 8:25 AM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > > >
> > > > On Fri, Jul 28, 2023 at 12:44 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > >
> > > > > On Thu, Jul 27, 2023 at 7:57 PM Menglong Dong <menglong8.dong@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Jul 28, 2023 at 3:31 AM Eric Dumazet <edumazet@google.com> wrote:
> > > > > > >
> > > > > > > On Thu, Jul 27, 2023 at 2:52 PM <menglong8.dong@gmail.com> wrote:
> > > > > > > >
> > > > > > > > From: Menglong Dong <imagedong@tencent.com>
> > > > > > > >
> > > > > > > > In tcp_retransmit_timer(), a window shrunk connection will be regarded
> > > > > > > > as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> > > > > > > > right all the time.
> > > > > > > >
> > > > > > > > The retransmits will become zero-window probes in tcp_retransmit_timer()
> > > > > > > > if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> > > > > > > > TCP_RTO_MAX sooner or later.
> > > > > > > >
> > > > > > > > However, the timer is not precise enough, as it base on timer wheel.
> > > > > > > > Sorry that I am not good at timer, but I know the concept of time-wheel.
> > > > > > > > The longer of the timer, the rougher it will be. So the timeout is not
> > > > > > > > triggered after TCP_RTO_MAX, but 122877ms as I tested.
> > > > > > > >
> > > > > > > > Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> > > > > > > > once the RTO come up to TCP_RTO_MAX.
> > > > > > > >
> > > > > > > > Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> > > > > > > > which is exact the timestamp of the timeout.
> > > > > > > >
> > > > > > > > Signed-off-by: Menglong Dong <imagedong@tencent.com>
> > > > > > > > ---
> > > > > > > >  net/ipv4/tcp_timer.c | 6 +++++-
> > > > > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > > > >
> > > > > > > > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > > > > > > > index 470f581eedd4..3a20db15a186 100644
> > > > > > > > --- a/net/ipv4/tcp_timer.c
> > > > > > > > +++ b/net/ipv4/tcp_timer.c
> > > > > > > > @@ -511,7 +511,11 @@ void tcp_retransmit_timer(struct sock *sk)
> > > > > > > >                                             tp->snd_una, tp->snd_nxt);
> > > > > > > >                 }
> > > > > > > >  #endif
> > > > > > > > -               if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > > > > +               /* It's a little rough here, we regard any valid packet that
> > > > > > > > +                * update tp->rcv_tstamp as the reply of the retransmitted
> > > > > > > > +                * packet.
> > > > > > > > +                */
> > > > > > > > +               if ((u32)icsk->icsk_timeout - tp->rcv_tstamp > TCP_RTO_MAX) {
> > > > > > > >                         tcp_write_err(sk);
> > > > > > > >                         goto out;
> > > > > > > >                 }
>
> One potential pre-existing issue with this logic: if the connection is
> restarting from idle, then tp->rcv_tstamp could already be a long time
> (minutes or hours) in the past even on the first RTO, in which case
> the very first RTO that found a zero tp->snd_wnd  would find this
> check returns true, and would destroy the connection immediately. This
> seems extremely brittle.
>
> AFAICT it would be safer to replace this logic with a call to the
> standard tcp_write_timeout() logic that has a more robust check to see
> if the connection should be destroyed.

How about we check it with:

icsk->icsk_timeout - max(tp->retrans_stamp, tp->rcv_tstamp) > TCP_RTO_MAX

?

Therefore, we don't need to do a lot of modification. I have to
say that the way we check here is rough, and a simple loss
of the retransmitted packet or the ACK can cause the die
of the socket.

Maybe we need a more perfect way here? which may introduce
a certain amount of modification.

BTW, should I split out this patch from this series?

>
> neal

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-07-31  9:49 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-27 12:51 [PATCH net-next 0/3] net: tcp: support probing OOM menglong8.dong
2023-07-27 12:51 ` [PATCH net-next 1/3] net: tcp: send zero-window ACK when no memory menglong8.dong
2023-07-27 19:17   ` Eric Dumazet
2023-07-28  2:37     ` Menglong Dong
2023-07-27 12:51 ` [PATCH net-next 2/3] net: tcp: allow zero-window ACK update the window menglong8.dong
2023-07-27 12:51 ` [PATCH net-next 3/3] net: tcp: check timeout by icsk->icsk_timeout in tcp_retransmit_timer() menglong8.dong
2023-07-27 19:31   ` Eric Dumazet
2023-07-28  2:57     ` Menglong Dong
2023-07-28  4:44       ` Neal Cardwell
2023-07-28  6:25         ` Menglong Dong
2023-07-28  8:50           ` Eric Dumazet
2023-07-28 14:25             ` Neal Cardwell
2023-07-28 17:15               ` Neal Cardwell
2023-07-31  3:41                 ` Menglong Dong
2023-07-31  9:49                 ` Menglong Dong
2023-07-31  3:28               ` Menglong Dong
2023-07-31  8:24               ` Menglong Dong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).