netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v3 0/3] net: tcp: support probing OOM
@ 2023-08-08 11:58 menglong8.dong
  2023-08-08 11:58 ` [PATCH net-next v3 1/3] net: tcp: send zero-window ACK when no memory menglong8.dong
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: menglong8.dong @ 2023-08-08 11:58 UTC (permalink / raw)
  To: edumazet, ncardwell
  Cc: davem, kuba, pabeni, dsahern, netdev, linux-kernel, flyingpeng,
	Menglong Dong

From: Menglong Dong <imagedong@tencent.com>

In this series, we make some small changes to make the tcp retransmission
become zero-window probes if the receiver drops the skb because of memory
pressure.

In the 1st patch, we reply a zero-window ACK if the skb is dropped
because out of memory, instead of dropping the skb silently.

In the 2nd patch, we allow a zero-window ACK to update the window.

In the 3rd patch, fix unexcepted socket die when snd_wnd is 0 in
tcp_retransmit_timer().

After these changes, the tcp can probe the OOM of the receiver forever.

Changes since v2:
- refactor the code to avoid code duplication in the 1st patch
- use after() instead of max() in tcp_rtx_probe0_timed_out()

Changes since v1:
- send 0 rwin ACK for the receive queue empty case when necessary in the
  1st patch
- send the ACK immediately by using the ICSK_ACK_NOW flag in the 1st
  patch
- consider the case of the connection restart from idle, as Neal comment,
  in the 3rd patch

Menglong Dong (3):
  net: tcp: send zero-window ACK when no memory
  net: tcp: allow zero-window ACK update the window
  net: tcp: fix unexcepted socket die when snd_wnd is 0

 include/net/inet_connection_sock.h |  3 ++-
 net/ipv4/tcp_input.c               | 20 +++++++++++++-------
 net/ipv4/tcp_output.c              | 14 +++++++++++---
 net/ipv4/tcp_timer.c               | 14 +++++++++++++-
 4 files changed, 39 insertions(+), 12 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH net-next v3 1/3] net: tcp: send zero-window ACK when no memory
  2023-08-08 11:58 [PATCH net-next v3 0/3] net: tcp: support probing OOM menglong8.dong
@ 2023-08-08 11:58 ` menglong8.dong
  2023-08-08 11:58 ` [PATCH net-next v3 2/3] net: tcp: allow zero-window ACK update the window menglong8.dong
  2023-08-08 11:58 ` [PATCH net-next v3 3/3] net: tcp: fix unexcepted socket die when snd_wnd is 0 menglong8.dong
  2 siblings, 0 replies; 6+ messages in thread
From: menglong8.dong @ 2023-08-08 11:58 UTC (permalink / raw)
  To: edumazet, ncardwell
  Cc: davem, kuba, pabeni, dsahern, netdev, linux-kernel, flyingpeng,
	Menglong Dong

From: Menglong Dong <imagedong@tencent.com>

For now, skb will be dropped when no memory, which makes client keep
retrans util timeout and it's not friendly to the users.

In this patch, we reply an ACK with zero-window in this case to update
the snd_wnd of the sender to 0. Therefore, the sender won't timeout the
connection and will probe the zero-window with the retransmits.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
---
v3:
- refactor the code to avoid code duplication
v2:
- send 0 rwin ACK for the receive queue empty case when necessary
- send the ACK immediately by using the ICSK_ACK_NOW flag
---
 include/net/inet_connection_sock.h |  3 ++-
 net/ipv4/tcp_input.c               | 18 ++++++++++++------
 net/ipv4/tcp_output.c              | 14 +++++++++++---
 3 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index c2b15f7e5516..be3c858a2ebb 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -164,7 +164,8 @@ enum inet_csk_ack_state_t {
 	ICSK_ACK_TIMER  = 2,
 	ICSK_ACK_PUSHED = 4,
 	ICSK_ACK_PUSHED2 = 8,
-	ICSK_ACK_NOW = 16	/* Send the next ACK immediately (once) */
+	ICSK_ACK_NOW = 16,	/* Send the next ACK immediately (once) */
+	ICSK_ACK_NOMEM = 32,
 };
 
 void inet_csk_init_xmit_timers(struct sock *sk,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 8e96ebe373d7..2ac059483410 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5059,13 +5059,19 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 
 		/* Ok. In sequence. In window. */
 queue_and_out:
-		if (skb_queue_len(&sk->sk_receive_queue) == 0)
-			sk_forced_mem_schedule(sk, skb->truesize);
-		else if (tcp_try_rmem_schedule(sk, skb, skb->truesize)) {
-			reason = SKB_DROP_REASON_PROTO_MEM;
-			NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRCVQDROP);
+		if (tcp_try_rmem_schedule(sk, skb, skb->truesize)) {
+			/* TODO: maybe ratelimit these WIN 0 ACK ? */
+			inet_csk(sk)->icsk_ack.pending |=
+					(ICSK_ACK_NOMEM | ICSK_ACK_NOW);
+			inet_csk_schedule_ack(sk);
 			sk->sk_data_ready(sk);
-			goto drop;
+
+			if (skb_queue_len(&sk->sk_receive_queue)) {
+				reason = SKB_DROP_REASON_PROTO_MEM;
+				NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRCVQDROP);
+				goto drop;
+			}
+			sk_forced_mem_schedule(sk, skb->truesize);
 		}
 
 		eaten = tcp_queue_rcv(sk, skb, &fragstolen);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c5412ee77fc8..769a558159ee 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -257,11 +257,19 @@ EXPORT_SYMBOL(tcp_select_initial_window);
 static u16 tcp_select_window(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	u32 old_win = tp->rcv_wnd;
-	u32 cur_win = tcp_receive_window(tp);
-	u32 new_win = __tcp_select_window(sk);
 	struct net *net = sock_net(sk);
+	u32 old_win = tp->rcv_wnd;
+	u32 cur_win, new_win;
+
+	/* Make the window 0 if we failed to queue the data because we
+	 * are out of memory. The window is temporary, so we don't store
+	 * it on the socket.
+	 */
+	if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+		return 0;
 
+	cur_win = tcp_receive_window(tp);
+	new_win = __tcp_select_window(sk);
 	if (new_win < cur_win) {
 		/* Danger Will Robinson!
 		 * Don't update rcv_wup/rcv_wnd here or else
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH net-next v3 2/3] net: tcp: allow zero-window ACK update the window
  2023-08-08 11:58 [PATCH net-next v3 0/3] net: tcp: support probing OOM menglong8.dong
  2023-08-08 11:58 ` [PATCH net-next v3 1/3] net: tcp: send zero-window ACK when no memory menglong8.dong
@ 2023-08-08 11:58 ` menglong8.dong
  2023-08-08 11:58 ` [PATCH net-next v3 3/3] net: tcp: fix unexcepted socket die when snd_wnd is 0 menglong8.dong
  2 siblings, 0 replies; 6+ messages in thread
From: menglong8.dong @ 2023-08-08 11:58 UTC (permalink / raw)
  To: edumazet, ncardwell
  Cc: davem, kuba, pabeni, dsahern, netdev, linux-kernel, flyingpeng,
	Menglong Dong

From: Menglong Dong <imagedong@tencent.com>

Fow now, an ACK can update the window in following case, according to
the tcp_may_update_window():

1. the ACK acknowledged new data
2. the ACK has new data
3. the ACK expand the window and the seq of it is valid

Now, we allow the ACK update the window if the window is 0, and the
seq/ack of it is valid. This is for the case that the receiver replies
an zero-window ACK when it is under memory stress and can't queue the new
data.

Signed-off-by: Menglong Dong <imagedong@tencent.com>
---
 net/ipv4/tcp_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2ac059483410..d34d52fdfdb1 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3525,7 +3525,7 @@ static inline bool tcp_may_update_window(const struct tcp_sock *tp,
 {
 	return	after(ack, tp->snd_una) ||
 		after(ack_seq, tp->snd_wl1) ||
-		(ack_seq == tp->snd_wl1 && nwin > tp->snd_wnd);
+		(ack_seq == tp->snd_wl1 && (nwin > tp->snd_wnd || !nwin));
 }
 
 /* If we update tp->snd_una, also update tp->bytes_acked */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH net-next v3 3/3] net: tcp: fix unexcepted socket die when snd_wnd is 0
  2023-08-08 11:58 [PATCH net-next v3 0/3] net: tcp: support probing OOM menglong8.dong
  2023-08-08 11:58 ` [PATCH net-next v3 1/3] net: tcp: send zero-window ACK when no memory menglong8.dong
  2023-08-08 11:58 ` [PATCH net-next v3 2/3] net: tcp: allow zero-window ACK update the window menglong8.dong
@ 2023-08-08 11:58 ` menglong8.dong
  2023-08-08 12:49   ` Eric Dumazet
  2 siblings, 1 reply; 6+ messages in thread
From: menglong8.dong @ 2023-08-08 11:58 UTC (permalink / raw)
  To: edumazet, ncardwell
  Cc: davem, kuba, pabeni, dsahern, netdev, linux-kernel, flyingpeng,
	Menglong Dong

From: Menglong Dong <imagedong@tencent.com>

In tcp_retransmit_timer(), a window shrunk connection will be regarded
as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
right all the time.

The retransmits will become zero-window probes in tcp_retransmit_timer()
if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
TCP_RTO_MAX sooner or later.

However, the timer can be delayed and be triggered after 122877ms, not
TCP_RTO_MAX, as I tested.

Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
once the RTO come up to TCP_RTO_MAX, and the socket will die.

Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
which is exact the timestamp of the timeout. Meanwhile, using the later
one of tp->retrans_stamp and tp->rcv_tstamp as the last updated timestamp
in the receiving path, as "tp->rcv_tstamp" can restart from idle, then
tp->rcv_tstamp could already be a long time (minutes or hours) in the
past even on the first RTO.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Link: https://lore.kernel.org/netdev/CADxym3YyMiO+zMD4zj03YPM3FBi-1LHi6gSD2XT8pyAMM096pg@mail.gmail.com/
Signed-off-by: Menglong Dong <imagedong@tencent.com>
---
v3:
- use after() instead of max() in tcp_rtx_probe0_timed_out()
v2:
- consider the case of the connection restart from idle, as Neal comment
---
 net/ipv4/tcp_timer.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index d45c96c7f5a4..f30d1467771c 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -454,6 +454,18 @@ static void tcp_fastopen_synack_timer(struct sock *sk, struct request_sock *req)
 			  req->timeout << req->num_timeout, TCP_RTO_MAX);
 }
 
+static bool tcp_rtx_probe0_timed_out(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	u32 timeout_ts, rtx_ts, rcv_ts;
+
+	rtx_ts = tp->retrans_stamp;
+	rcv_ts = tp->rcv_tstamp;
+	timeout_ts = after(rtx_ts, rcv_ts) ? rtx_ts : rcv_ts;
+	timeout_ts += TCP_RTO_MAX;
+
+	return after(inet_csk(sk)->icsk_timeout, timeout_ts);
+}
 
 /**
  *  tcp_retransmit_timer() - The TCP retransmit timeout handler
@@ -519,7 +531,7 @@ void tcp_retransmit_timer(struct sock *sk)
 					    tp->snd_una, tp->snd_nxt);
 		}
 #endif
-		if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
+		if (tcp_rtx_probe0_timed_out(sk)) {
 			tcp_write_err(sk);
 			goto out;
 		}
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next v3 3/3] net: tcp: fix unexcepted socket die when snd_wnd is 0
  2023-08-08 11:58 ` [PATCH net-next v3 3/3] net: tcp: fix unexcepted socket die when snd_wnd is 0 menglong8.dong
@ 2023-08-08 12:49   ` Eric Dumazet
  2023-08-08 13:07     ` Menglong Dong
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2023-08-08 12:49 UTC (permalink / raw)
  To: menglong8.dong
  Cc: ncardwell, davem, kuba, pabeni, dsahern, netdev, linux-kernel,
	flyingpeng, Menglong Dong

On Tue, Aug 8, 2023 at 1:59 PM <menglong8.dong@gmail.com> wrote:
>
> From: Menglong Dong <imagedong@tencent.com>
>
> In tcp_retransmit_timer(), a window shrunk connection will be regarded
> as timeout if 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX'. This is not
> right all the time.
>
> The retransmits will become zero-window probes in tcp_retransmit_timer()
> if the 'snd_wnd==0'. Therefore, the icsk->icsk_rto will come up to
> TCP_RTO_MAX sooner or later.
>
> However, the timer can be delayed and be triggered after 122877ms, not
> TCP_RTO_MAX, as I tested.
>
> Therefore, 'tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX' is always true
> once the RTO come up to TCP_RTO_MAX, and the socket will die.
>
> Fix this by replacing the 'tcp_jiffies32' with '(u32)icsk->icsk_timeout',
> which is exact the timestamp of the timeout. Meanwhile, using the later
> one of tp->retrans_stamp and tp->rcv_tstamp as the last updated timestamp
> in the receiving path, as "tp->rcv_tstamp" can restart from idle, then
> tp->rcv_tstamp could already be a long time (minutes or hours) in the
> past even on the first RTO.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Link: https://lore.kernel.org/netdev/CADxym3YyMiO+zMD4zj03YPM3FBi-1LHi6gSD2XT8pyAMM096pg@mail.gmail.com/
> Signed-off-by: Menglong Dong <imagedong@tencent.com>
> ---
> v3:
> - use after() instead of max() in tcp_rtx_probe0_timed_out()
> v2:
> - consider the case of the connection restart from idle, as Neal comment
> ---
>  net/ipv4/tcp_timer.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index d45c96c7f5a4..f30d1467771c 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -454,6 +454,18 @@ static void tcp_fastopen_synack_timer(struct sock *sk, struct request_sock *req)
>                           req->timeout << req->num_timeout, TCP_RTO_MAX);
>  }
>
> +static bool tcp_rtx_probe0_timed_out(struct sock *sk)

const struct sock *sk

> +{
> +       struct tcp_sock *tp = tcp_sk(sk);

const struct tcp_sock *tp = tcp_sk(sk);

> +       u32 timeout_ts, rtx_ts, rcv_ts;
> +
> +       rtx_ts = tp->retrans_stamp;
> +       rcv_ts = tp->rcv_tstamp;
> +       timeout_ts = after(rtx_ts, rcv_ts) ? rtx_ts : rcv_ts;
> +       timeout_ts += TCP_RTO_MAX;

If we are concerned with a socket dying too soon, I would suggest
adding 2*TCP_RTO_MAX instead of TCP_RTO_MAX

When a receiver is OOMing, it is possible the ACK RWIN 0 can not be sent all,
so tp->rcv_tstamp will not be refreshed. Or ACK could be lost in the
network path.

This also suggests the net_dbg_ratelimited("Peer %pI4:%u/%u
unexpectedly shrunk window %u:%u (repaired)\n"...) messages
are slightly wrong, because they could be printed even if we did not
receive a new ACK packet from the remote peer.

Perhaps we should change them to include delays (how long @skb stayed
in rtx queue, how old is the last ACK we received)

> +
> +       return after(inet_csk(sk)->icsk_timeout, timeout_ts);
> +}
>
>  /**
>   *  tcp_retransmit_timer() - The TCP retransmit timeout handler
> @@ -519,7 +531,7 @@ void tcp_retransmit_timer(struct sock *sk)
>                                             tp->snd_una, tp->snd_nxt);
>                 }
>  #endif
> -               if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
> +               if (tcp_rtx_probe0_timed_out(sk)) {
>                         tcp_write_err(sk);
>                         goto out;
>                 }
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net-next v3 3/3] net: tcp: fix unexcepted socket die when snd_wnd is 0
  2023-08-08 12:49   ` Eric Dumazet
@ 2023-08-08 13:07     ` Menglong Dong
  0 siblings, 0 replies; 6+ messages in thread
From: Menglong Dong @ 2023-08-08 13:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: ncardwell, davem, kuba, pabeni, dsahern, netdev, linux-kernel,
	flyingpeng, Menglong Dong



> On Aug 8, 2023, at 20:49, Eric Dumazet <edumazet@google.com> wrote:
> 
> On Tue, Aug 8, 2023 at 1:59 PM <menglong8.dong@gmail.com> wrote:
>> 
[……]
>> 
>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>> index d45c96c7f5a4..f30d1467771c 100644
>> --- a/net/ipv4/tcp_timer.c
>> +++ b/net/ipv4/tcp_timer.c
>> @@ -454,6 +454,18 @@ static void tcp_fastopen_synack_timer(struct sock *sk, struct request_sock *req)
>>                          req->timeout << req->num_timeout, TCP_RTO_MAX);
>> }
>> 
>> +static bool tcp_rtx_probe0_timed_out(struct sock *sk)
> 
> const struct sock *sk
> 
>> +{
>> +       struct tcp_sock *tp = tcp_sk(sk);
> 
> const struct tcp_sock *tp = tcp_sk(sk);
> 
>> +       u32 timeout_ts, rtx_ts, rcv_ts;
>> +
>> +       rtx_ts = tp->retrans_stamp;
>> +       rcv_ts = tp->rcv_tstamp;
>> +       timeout_ts = after(rtx_ts, rcv_ts) ? rtx_ts : rcv_ts;
>> +       timeout_ts += TCP_RTO_MAX;
> 
> If we are concerned with a socket dying too soon, I would suggest
> adding 2*TCP_RTO_MAX instead of TCP_RTO_MAX
> 
> When a receiver is OOMing, it is possible the ACK RWIN 0 can not be sent all,
> so tp->rcv_tstamp will not be refreshed. Or ACK could be lost in the
> network path.

Yeah, I concern abort this too. I introduce the funtion “tcp_rtx_probe0_timed_out()”
here is to offer a more reliable way to check the timeout in the feature.
And for this time, we can fix the problem first, as you advised, adding
2*TCP_RTO_MAX instead of TCP_RTO_MAX.

> This also suggests the net_dbg_ratelimited("Peer %pI4:%u/%u
> unexpectedly shrunk window %u:%u (repaired)\n"...) messages
> are slightly wrong, because they could be printed even if we did not
> receive a new ACK packet from the remote peer.
> 
> Perhaps we should change them to include delays (how long @skb stayed
> in rtx queue, how old is the last ACK we received)

Sounds great, which is more valuable. I’ll change them
in the next version.

Thanks!
Menglong Dong

>> +
>> +       return after(inet_csk(sk)->icsk_timeout, timeout_ts);
>> +}
>> 
>> /**
>>  *  tcp_retransmit_timer() - The TCP retransmit timeout handler
>> @@ -519,7 +531,7 @@ void tcp_retransmit_timer(struct sock *sk)
>>                                            tp->snd_una, tp->snd_nxt);
>>                }
>> #endif
>> -               if (tcp_jiffies32 - tp->rcv_tstamp > TCP_RTO_MAX) {
>> +               if (tcp_rtx_probe0_timed_out(sk)) {
>>                        tcp_write_err(sk);
>>                        goto out;
>>                }
>> --
>> 2.40.1



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-08-08 16:38 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-08-08 11:58 [PATCH net-next v3 0/3] net: tcp: support probing OOM menglong8.dong
2023-08-08 11:58 ` [PATCH net-next v3 1/3] net: tcp: send zero-window ACK when no memory menglong8.dong
2023-08-08 11:58 ` [PATCH net-next v3 2/3] net: tcp: allow zero-window ACK update the window menglong8.dong
2023-08-08 11:58 ` [PATCH net-next v3 3/3] net: tcp: fix unexcepted socket die when snd_wnd is 0 menglong8.dong
2023-08-08 12:49   ` Eric Dumazet
2023-08-08 13:07     ` Menglong Dong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).