[PATCH net-next 0/8] tcp: receiver changes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next 0/8] tcp: receiver changes
@ 2025-07-11 11:39 Eric Dumazet
  2025-07-11 11:39 ` [PATCH net-next 1/8] tcp: do not accept packets beyond window Eric Dumazet
                   ` (9 more replies)
  0 siblings, 10 replies; 40+ messages in thread
From: Eric Dumazet @ 2025-07-11 11:39 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

Before accepting an incoming packet:

- Make sure to not accept a packet beyond advertized RWIN.
  If not, increment a new SNMP counter (LINUX_MIB_BEYOND_WINDOW)

- ooo packets should update rcv_mss and tp->scaling_ratio.

- Make sure to not accept packet beyond sk_rcvbuf limit.

This series includes three associated packetdrill tests.

Eric Dumazet (8):
  tcp: do not accept packets beyond window
  tcp: add LINUX_MIB_BEYOND_WINDOW
  selftests/net: packetdrill: add tcp_rcv_big_endseq.pkt
  tcp: call tcp_measure_rcv_mss() for ooo packets
  selftests/net: packetdrill: add tcp_ooo_rcv_mss.pkt
  tcp: add const to tcp_try_rmem_schedule() and sk_rmem_schedule() skb
  tcp: stronger sk_rcvbuf checks
  selftests/net: packetdrill: add tcp_rcv_toobig.pkt

 .../networking/net_cachelines/snmp.rst        |  1 +
 include/net/dropreason-core.h                 |  9 +++-
 include/net/sock.h                            |  2 +-
 include/uapi/linux/snmp.h                     |  1 +
 net/ipv4/proc.c                               |  1 +
 net/ipv4/tcp_input.c                          | 48 ++++++++++++++-----
 .../net/packetdrill/tcp_ooo_rcv_mss.pkt       | 27 +++++++++++
 .../net/packetdrill/tcp_rcv_big_endseq.pkt    | 44 +++++++++++++++++
 .../net/packetdrill/tcp_rcv_toobig.pkt        | 33 +++++++++++++
 9 files changed, 152 insertions(+), 14 deletions(-)
 create mode 100644 tools/testing/selftests/net/packetdrill/tcp_ooo_rcv_mss.pkt
 create mode 100644 tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
 create mode 100644 tools/testing/selftests/net/packetdrill/tcp_rcv_toobig.pkt

-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH net-next 1/8] tcp: do not accept packets beyond window
  2025-07-11 11:39 [PATCH net-next 0/8] tcp: receiver changes Eric Dumazet
@ 2025-07-11 11:39 ` Eric Dumazet
  2025-07-12 20:52   ` Kuniyuki Iwashima
  2025-07-15  1:38   ` Jakub Kicinski
  2025-07-11 11:40 ` [PATCH net-next 2/8] tcp: add LINUX_MIB_BEYOND_WINDOW Eric Dumazet
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 40+ messages in thread
From: Eric Dumazet @ 2025-07-11 11:39 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

Currently, TCP accepts incoming packets which might go beyond the
offered RWIN.

Add to tcp_sequence() the validation of packet end sequence.

Add the corresponding check in the fast path.

We relax this new constraint if the receive queue is empty,
to not freeze flows from buggy peers.

Add a new drop reason : SKB_DROP_REASON_TCP_INVALID_END_SEQUENCE.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/dropreason-core.h |  7 ++++++-
 net/ipv4/tcp_input.c          | 22 +++++++++++++++++-----
 2 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h
index b9e78290269e6e7d9d9155171f6b0ef03c7697c9..d88ff9a75d15fe60a961332a7eb4be94c5c7c3ec 100644
--- a/include/net/dropreason-core.h
+++ b/include/net/dropreason-core.h
@@ -45,6 +45,7 @@
 	FN(TCP_LISTEN_OVERFLOW)		\
 	FN(TCP_OLD_SEQUENCE)		\
 	FN(TCP_INVALID_SEQUENCE)	\
+	FN(TCP_INVALID_END_SEQUENCE)	\
 	FN(TCP_INVALID_ACK_SEQUENCE)	\
 	FN(TCP_RESET)			\
 	FN(TCP_INVALID_SYN)		\
@@ -303,8 +304,12 @@ enum skb_drop_reason {
 	SKB_DROP_REASON_TCP_LISTEN_OVERFLOW,
 	/** @SKB_DROP_REASON_TCP_OLD_SEQUENCE: Old SEQ field (duplicate packet) */
 	SKB_DROP_REASON_TCP_OLD_SEQUENCE,
-	/** @SKB_DROP_REASON_TCP_INVALID_SEQUENCE: Not acceptable SEQ field */
+	/** @SKB_DROP_REASON_TCP_INVALID_SEQUENCE: Not acceptable SEQ field.
+	 */
 	SKB_DROP_REASON_TCP_INVALID_SEQUENCE,
+	/** @SKB_DROP_REASON_TCP_INVALID_END_SEQUENCE: Not acceptable END_SEQ field.
+	 */
+	SKB_DROP_REASON_TCP_INVALID_END_SEQUENCE,
 	/**
 	 * @SKB_DROP_REASON_TCP_INVALID_ACK_SEQUENCE: Not acceptable ACK SEQ
 	 * field because ack sequence is not in the window between snd_una
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9b03c44c12b862b5d33f4390cfc85e2f8897cd8e..f0f9c78654b449cb2a122e8c53fdcc96e5317de7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4391,14 +4391,22 @@ static enum skb_drop_reason tcp_disordered_ack_check(const struct sock *sk,
  * (borrowed from freebsd)
  */
 
-static enum skb_drop_reason tcp_sequence(const struct tcp_sock *tp,
+static enum skb_drop_reason tcp_sequence(const struct sock *sk,
 					 u32 seq, u32 end_seq)
 {
+	const struct tcp_sock *tp = tcp_sk(sk);
+
 	if (before(end_seq, tp->rcv_wup))
 		return SKB_DROP_REASON_TCP_OLD_SEQUENCE;
 
-	if (after(seq, tp->rcv_nxt + tcp_receive_window(tp)))
-		return SKB_DROP_REASON_TCP_INVALID_SEQUENCE;
+	if (after(end_seq, tp->rcv_nxt + tcp_receive_window(tp))) {
+		if (after(seq, tp->rcv_nxt + tcp_receive_window(tp)))
+			return SKB_DROP_REASON_TCP_INVALID_SEQUENCE;
+
+		/* Only accept this packet if receive queue is empty. */
+		if (skb_queue_len(&sk->sk_receive_queue))
+			return SKB_DROP_REASON_TCP_INVALID_END_SEQUENCE;
+	}
 
 	return SKB_NOT_DROPPED_YET;
 }
@@ -5881,7 +5889,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 
 step1:
 	/* Step 1: check sequence number */
-	reason = tcp_sequence(tp, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
+	reason = tcp_sequence(sk, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
 	if (reason) {
 		/* RFC793, page 37: "In all states except SYN-SENT, all reset
 		 * (RST) segments are validated by checking their SEQ-fields."
@@ -6110,6 +6118,10 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 			if (tcp_checksum_complete(skb))
 				goto csum_error;
 
+			if (after(TCP_SKB_CB(skb)->end_seq,
+				  tp->rcv_nxt + tcp_receive_window(tp)))
+				goto validate;
+
 			if ((int)skb->truesize > sk->sk_forward_alloc)
 				goto step5;
 
@@ -6165,7 +6177,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 	/*
 	 *	Standard slow path.
 	 */
-
+validate:
 	if (!tcp_validate_incoming(sk, skb, th, 1))
 		return;
 
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH net-next 2/8] tcp: add LINUX_MIB_BEYOND_WINDOW
  2025-07-11 11:39 [PATCH net-next 0/8] tcp: receiver changes Eric Dumazet
  2025-07-11 11:39 ` [PATCH net-next 1/8] tcp: do not accept packets beyond window Eric Dumazet
@ 2025-07-11 11:40 ` Eric Dumazet
  2025-07-12 20:55   ` Kuniyuki Iwashima
  2025-07-11 11:40 ` [PATCH net-next 3/8] selftests/net: packetdrill: add tcp_rcv_big_endseq.pkt Eric Dumazet
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2025-07-11 11:40 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

Add a new SNMP MIB : LINUX_MIB_BEYOND_WINDOW

Incremented when an incoming packet is received beyond the
receiver window.

nstat -az | grep TcpExtBeyondWindow

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 Documentation/networking/net_cachelines/snmp.rst | 1 +
 include/net/dropreason-core.h                    | 2 ++
 include/uapi/linux/snmp.h                        | 1 +
 net/ipv4/proc.c                                  | 1 +
 net/ipv4/tcp_input.c                             | 1 +
 5 files changed, 6 insertions(+)

diff --git a/Documentation/networking/net_cachelines/snmp.rst b/Documentation/networking/net_cachelines/snmp.rst
index bd44b3eebbef75352599883b9dde36e7889d4120..bce4eb35ec48112ec43d99c58351d3b646a708ec 100644
--- a/Documentation/networking/net_cachelines/snmp.rst
+++ b/Documentation/networking/net_cachelines/snmp.rst
@@ -36,6 +36,7 @@ unsigned_long  LINUX_MIB_TIMEWAITRECYCLED
 unsigned_long  LINUX_MIB_TIMEWAITKILLED
 unsigned_long  LINUX_MIB_PAWSACTIVEREJECTED
 unsigned_long  LINUX_MIB_PAWSESTABREJECTED
+unsigned_long  LINUX_MIB_BEYOND_WINDOW
 unsigned_long  LINUX_MIB_TSECR_REJECTED
 unsigned_long  LINUX_MIB_PAWS_OLD_ACK
 unsigned_long  LINUX_MIB_PAWS_TW_REJECTED
diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h
index d88ff9a75d15fe60a961332a7eb4be94c5c7c3ec..6176e060541f330792014dd6081d1d0857536640 100644
--- a/include/net/dropreason-core.h
+++ b/include/net/dropreason-core.h
@@ -305,9 +305,11 @@ enum skb_drop_reason {
 	/** @SKB_DROP_REASON_TCP_OLD_SEQUENCE: Old SEQ field (duplicate packet) */
 	SKB_DROP_REASON_TCP_OLD_SEQUENCE,
 	/** @SKB_DROP_REASON_TCP_INVALID_SEQUENCE: Not acceptable SEQ field.
+	 * Corresponds to LINUX_MIB_BEYOND_WINDOW.
 	 */
 	SKB_DROP_REASON_TCP_INVALID_SEQUENCE,
 	/** @SKB_DROP_REASON_TCP_INVALID_END_SEQUENCE: Not acceptable END_SEQ field.
+	 * Corresponds to LINUX_MIB_BEYOND_WINDOW.
 	 */
 	SKB_DROP_REASON_TCP_INVALID_END_SEQUENCE,
 	/**
diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 1d234d7e1892778c5ff04c240f8360608f391401..49f5640092a0df7ca2bfb01e87a627d9b1bc4233 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -186,6 +186,7 @@ enum
 	LINUX_MIB_TIMEWAITKILLED,		/* TimeWaitKilled */
 	LINUX_MIB_PAWSACTIVEREJECTED,		/* PAWSActiveRejected */
 	LINUX_MIB_PAWSESTABREJECTED,		/* PAWSEstabRejected */
+	LINUX_MIB_BEYOND_WINDOW,		/* BeyondWindow */
 	LINUX_MIB_TSECRREJECTED,		/* TSEcrRejected */
 	LINUX_MIB_PAWS_OLD_ACK,			/* PAWSOldAck */
 	LINUX_MIB_PAWS_TW_REJECTED,		/* PAWSTimewait */
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index ea2f01584379a59a0a01226ae0f45d3614733fef..65b0d0ab0084029db43135a91da6eeb1f1fba024 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -189,6 +189,7 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TWKilled", LINUX_MIB_TIMEWAITKILLED),
 	SNMP_MIB_ITEM("PAWSActive", LINUX_MIB_PAWSACTIVEREJECTED),
 	SNMP_MIB_ITEM("PAWSEstab", LINUX_MIB_PAWSESTABREJECTED),
+	SNMP_MIB_ITEM("BeyondWindow", LINUX_MIB_BEYOND_WINDOW),
 	SNMP_MIB_ITEM("TSEcrRejected", LINUX_MIB_TSECRREJECTED),
 	SNMP_MIB_ITEM("PAWSOldAck", LINUX_MIB_PAWS_OLD_ACK),
 	SNMP_MIB_ITEM("PAWSTimewait", LINUX_MIB_PAWS_TW_REJECTED),
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f0f9c78654b449cb2a122e8c53fdcc96e5317de7..5e2d82c273e2fc914706651a660464db4aba8504 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5900,6 +5900,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 		if (!th->rst) {
 			if (th->syn)
 				goto syn_challenge;
+			NET_INC_STATS(sock_net(sk), LINUX_MIB_BEYOND_WINDOW);
 			if (!tcp_oow_rate_limited(sock_net(sk), skb,
 						  LINUX_MIB_TCPACKSKIPPEDSEQ,
 						  &tp->last_oow_ack_time))
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH net-next 3/8] selftests/net: packetdrill: add tcp_rcv_big_endseq.pkt
  2025-07-11 11:39 [PATCH net-next 0/8] tcp: receiver changes Eric Dumazet
  2025-07-11 11:39 ` [PATCH net-next 1/8] tcp: do not accept packets beyond window Eric Dumazet
  2025-07-11 11:40 ` [PATCH net-next 2/8] tcp: add LINUX_MIB_BEYOND_WINDOW Eric Dumazet
@ 2025-07-11 11:40 ` Eric Dumazet
  2025-07-12 20:58   ` Kuniyuki Iwashima
  2025-07-11 11:40 ` [PATCH net-next 4/8] tcp: call tcp_measure_rcv_mss() for ooo packets Eric Dumazet
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2025-07-11 11:40 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

This test checks TCP behavior when receiving a packet beyond the window.

It checks the new TcpExtBeyondWindow SNMP counter.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 .../net/packetdrill/tcp_rcv_big_endseq.pkt    | 44 +++++++++++++++++++
 1 file changed, 44 insertions(+)
 create mode 100644 tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt

diff --git a/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt b/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
new file mode 100644
index 0000000000000000000000000000000000000000..7e170b94fd366ef516d68cf97bf921fdbf437ca8
--- /dev/null
+++ b/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+
+--mss=1000
+
+`./defaults.sh`
+
+    0 `nstat -n`
+
+// Establish a connection.
+   +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+   +0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [10000], 4) = 0
+   +0 bind(3, ..., ...) = 0
+   +0 listen(3, 1) = 0
+
+   +0 < S 0:0(0) win 32792 <mss 1000,nop,wscale 7>
+   +0 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 0>
+  +.1 < . 1:1(0) ack 1 win 257
+
+  +0 accept(3, ..., ...) = 4
+
+  +0 < P. 1:4001(4000) ack 1 win 257
+  +0 > .  1:1(0) ack 4001 win 5000
+
+// packet in sequence : SKB_DROP_REASON_TCP_INVALID_END_SEQUENCE / LINUX_MIB_BEYOND_WINDOW
+  +0 < P. 4001:54001(50000) ack 1 win 257
+  +0 > .  1:1(0) ack 4001 win 5000
+
+// ooo packet. : SKB_DROP_REASON_TCP_INVALID_END_SEQUENCE / LINUX_MIB_BEYOND_WINDOW
+  +1 < P. 5001:55001(50000) ack 1 win 257
+  +0 > .  1:1(0) ack 4001 win 5000
+
+// SKB_DROP_REASON_TCP_INVALID_SEQUENCE / LINUX_MIB_BEYOND_WINDOW
+  +0 < P. 70001:80001(10000) ack 1 win 257
+  +0 > .  1:1(0) ack 4001 win 5000
+
+  +0 read(4, ..., 100000) = 4000
+
+// If queue is empty, accept a packet even if its end_seq is above wup + rcv_wnd
+  +0 < P. 4001:54001(50000) ack 1 win 257
+  +.040 > .  1:1(0) ack 54001 win 0
+
+// Check LINUX_MIB_BEYOND_WINDOW has been incremented 3 times.
++0 `nstat | grep TcpExtBeyondWindow | grep -q " 3 "`
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH net-next 4/8] tcp: call tcp_measure_rcv_mss() for ooo packets
  2025-07-11 11:39 [PATCH net-next 0/8] tcp: receiver changes Eric Dumazet
                   ` (2 preceding siblings ...)
  2025-07-11 11:40 ` [PATCH net-next 3/8] selftests/net: packetdrill: add tcp_rcv_big_endseq.pkt Eric Dumazet
@ 2025-07-11 11:40 ` Eric Dumazet
  2025-07-12 21:11   ` Kuniyuki Iwashima
  2025-07-11 11:40 ` [PATCH net-next 5/8] selftests/net: packetdrill: add tcp_ooo_rcv_mss.pkt Eric Dumazet
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2025-07-11 11:40 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

tcp_measure_rcv_mss() is used to update icsk->icsk_ack.rcv_mss
(tcpi_rcv_mss in tcp_info) and tp->scaling_ratio.

Calling it from tcp_data_queue_ofo() makes sure these
fields are updated, and permits a better tuning
of sk->sk_rcvbuf, in the case a new flow receives many ooo
packets.

Fixes: dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale")
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5e2d82c273e2fc914706651a660464db4aba8504..78da05933078b5b665113b57a0edc03b29820496 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4923,6 +4923,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 		return;
 	}
 
+	tcp_measure_rcv_mss(sk, skb);
 	/* Disable header prediction. */
 	tp->pred_flags = 0;
 	inet_csk_schedule_ack(sk);
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH net-next 5/8] selftests/net: packetdrill: add tcp_ooo_rcv_mss.pkt
  2025-07-11 11:39 [PATCH net-next 0/8] tcp: receiver changes Eric Dumazet
                   ` (3 preceding siblings ...)
  2025-07-11 11:40 ` [PATCH net-next 4/8] tcp: call tcp_measure_rcv_mss() for ooo packets Eric Dumazet
@ 2025-07-11 11:40 ` Eric Dumazet
  2025-07-12 21:42   ` Kuniyuki Iwashima
  2025-07-11 11:40 ` [PATCH net-next 6/8] tcp: add const to tcp_try_rmem_schedule() and sk_rmem_schedule() skb Eric Dumazet
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2025-07-11 11:40 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

We make sure tcpi_rcv_mss and tp->scaling_ratio
are correctly updated if no in-order packet has been received yet.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 .../net/packetdrill/tcp_ooo_rcv_mss.pkt       | 27 +++++++++++++++++++
 1 file changed, 27 insertions(+)
 create mode 100644 tools/testing/selftests/net/packetdrill/tcp_ooo_rcv_mss.pkt

diff --git a/tools/testing/selftests/net/packetdrill/tcp_ooo_rcv_mss.pkt b/tools/testing/selftests/net/packetdrill/tcp_ooo_rcv_mss.pkt
new file mode 100644
index 0000000000000000000000000000000000000000..7e6bc5fb0c8d78f36dc3d18842ff11d938c4e41b
--- /dev/null
+++ b/tools/testing/selftests/net/packetdrill/tcp_ooo_rcv_mss.pkt
@@ -0,0 +1,27 @@
+// SPDX-License-Identifier: GPL-2.0
+
+--mss=1000
+
+`./defaults.sh
+sysctl -q net.ipv4.tcp_rmem="4096 131072 $((32*1024*1024))"`
+
+   +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+   +0 bind(3, ..., ...) = 0
+   +0 listen(3, 1) = 0
+
+   +0 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 7>
+   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 10>
+  +.1 < . 1:1(0) ack 1 win 257
+
+   +0 accept(3, ..., ...) = 4
+
+   +0 < . 2001:11001(9000) ack 1 win 257
+   +0 > . 1:1(0) ack 1 win 81 <nop,nop,sack 2001:11001>
+
+// check that ooo packet properly updates tcpi_rcv_mss
+   +0 %{ assert tcpi_rcv_mss == 1000, tcpi_rcv_mss }%
+
+   +0 < . 11001:21001(10000) ack 1 win 257
+   +0 > . 1:1(0) ack 1 win 81 <nop,nop,sack 2001:21001>
+
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH net-next 6/8] tcp: add const to tcp_try_rmem_schedule() and sk_rmem_schedule() skb
  2025-07-11 11:39 [PATCH net-next 0/8] tcp: receiver changes Eric Dumazet
                   ` (4 preceding siblings ...)
  2025-07-11 11:40 ` [PATCH net-next 5/8] selftests/net: packetdrill: add tcp_ooo_rcv_mss.pkt Eric Dumazet
@ 2025-07-11 11:40 ` Eric Dumazet
  2025-07-12 21:43   ` Kuniyuki Iwashima
  2025-07-11 11:40 ` [PATCH net-next 7/8] tcp: stronger sk_rcvbuf checks Eric Dumazet
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2025-07-11 11:40 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

These functions to not modify the skb, add a const qualifier.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/sock.h   | 2 +-
 net/ipv4/tcp_input.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 0f2443d4ec581639eb3bdc46cb9b2932123e9246..c8a4b283df6fc4b931270502ddbb5df7ae1e4aa2 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1553,7 +1553,7 @@ __sk_rmem_schedule(struct sock *sk, int size, bool pfmemalloc)
 }
 
 static inline bool
-sk_rmem_schedule(struct sock *sk, struct sk_buff *skb, int size)
+sk_rmem_schedule(struct sock *sk, const struct sk_buff *skb, int size)
 {
 	return __sk_rmem_schedule(sk, size, skb_pfmemalloc(skb));
 }
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 78da05933078b5b665113b57a0edc03b29820496..39de55ff898e6ec9c6e5bc9dc7b80ec9d235ca44 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4888,7 +4888,7 @@ static void tcp_ofo_queue(struct sock *sk)
 static bool tcp_prune_ofo_queue(struct sock *sk, const struct sk_buff *in_skb);
 static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb);
 
-static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
+static int tcp_try_rmem_schedule(struct sock *sk, const struct sk_buff *skb,
 				 unsigned int size)
 {
 	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH net-next 7/8] tcp: stronger sk_rcvbuf checks
  2025-07-11 11:39 [PATCH net-next 0/8] tcp: receiver changes Eric Dumazet
                   ` (5 preceding siblings ...)
  2025-07-11 11:40 ` [PATCH net-next 6/8] tcp: add const to tcp_try_rmem_schedule() and sk_rmem_schedule() skb Eric Dumazet
@ 2025-07-11 11:40 ` Eric Dumazet
  2025-07-12 21:54   ` Kuniyuki Iwashima
                     ` (2 more replies)
  2025-07-11 11:40 ` [PATCH net-next 8/8] selftests/net: packetdrill: add tcp_rcv_toobig.pkt Eric Dumazet
                   ` (2 subsequent siblings)
  9 siblings, 3 replies; 40+ messages in thread
From: Eric Dumazet @ 2025-07-11 11:40 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

Currently, TCP stack accepts incoming packet if sizes of receive queues
are below sk->sk_rcvbuf limit.

This can cause memory overshoot if the packet is big, like an 1/2 MB
BIG TCP one.

Refine the check to take into account the incoming skb truesize.

Note that we still accept the packet if the receive queue is empty,
to not completely freeze TCP flows in pathological conditions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 39de55ff898e6ec9c6e5bc9dc7b80ec9d235ca44..9c5baace4b7b24140ab5e0eafc397f124c8c64dd 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4888,10 +4888,20 @@ static void tcp_ofo_queue(struct sock *sk)
 static bool tcp_prune_ofo_queue(struct sock *sk, const struct sk_buff *in_skb);
 static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb);
 
+/* Check if this incoming skb can be added to socket receive queues
+ * while satisfying sk->sk_rcvbuf limit.
+ */
+static bool tcp_can_ingest(const struct sock *sk, const struct sk_buff *skb)
+{
+	unsigned int new_mem = atomic_read(&sk->sk_rmem_alloc) + skb->truesize;
+
+	return new_mem <= sk->sk_rcvbuf;
+}
+
 static int tcp_try_rmem_schedule(struct sock *sk, const struct sk_buff *skb,
 				 unsigned int size)
 {
-	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
+	if (!tcp_can_ingest(sk, skb) ||
 	    !sk_rmem_schedule(sk, skb, size)) {
 
 		if (tcp_prune_queue(sk, skb) < 0)
@@ -5507,7 +5517,7 @@ static bool tcp_prune_ofo_queue(struct sock *sk, const struct sk_buff *in_skb)
 		tcp_drop_reason(sk, skb, SKB_DROP_REASON_TCP_OFO_QUEUE_PRUNE);
 		tp->ooo_last_skb = rb_to_skb(prev);
 		if (!prev || goal <= 0) {
-			if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf &&
+			if (tcp_can_ingest(sk, skb) &&
 			    !tcp_under_memory_pressure(sk))
 				break;
 			goal = sk->sk_rcvbuf >> 3;
@@ -5541,12 +5551,12 @@ static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb)
 
 	NET_INC_STATS(sock_net(sk), LINUX_MIB_PRUNECALLED);
 
-	if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
+	if (!tcp_can_ingest(sk, in_skb))
 		tcp_clamp_window(sk);
 	else if (tcp_under_memory_pressure(sk))
 		tcp_adjust_rcv_ssthresh(sk);
 
-	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
+	if (tcp_can_ingest(sk, in_skb))
 		return 0;
 
 	tcp_collapse_ofo_queue(sk);
@@ -5556,7 +5566,7 @@ static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb)
 			     NULL,
 			     tp->copied_seq, tp->rcv_nxt);
 
-	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
+	if (tcp_can_ingest(sk, in_skb))
 		return 0;
 
 	/* Collapsing did not help, destructive actions follow.
@@ -5564,7 +5574,7 @@ static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb)
 
 	tcp_prune_ofo_queue(sk, in_skb);
 
-	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
+	if (tcp_can_ingest(sk, in_skb))
 		return 0;
 
 	/* If we are really being abused, tell the caller to silently
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH net-next 8/8] selftests/net: packetdrill: add tcp_rcv_toobig.pkt
  2025-07-11 11:39 [PATCH net-next 0/8] tcp: receiver changes Eric Dumazet
                   ` (6 preceding siblings ...)
  2025-07-11 11:40 ` [PATCH net-next 7/8] tcp: stronger sk_rcvbuf checks Eric Dumazet
@ 2025-07-11 11:40 ` Eric Dumazet
  2025-07-12 21:57   ` Kuniyuki Iwashima
  2025-07-15  2:20 ` [PATCH net-next 0/8] tcp: receiver changes patchwork-bot+netdevbpf
  2025-07-15  8:25 ` Paolo Abeni
  9 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2025-07-11 11:40 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, Eric Dumazet

Check that TCP receiver behavior after "tcp: stronger sk_rcvbuf checks"

Too fat packet is dropped unless receive queue is empty.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 .../net/packetdrill/tcp_rcv_toobig.pkt        | 33 +++++++++++++++++++
 1 file changed, 33 insertions(+)
 create mode 100644 tools/testing/selftests/net/packetdrill/tcp_rcv_toobig.pkt

diff --git a/tools/testing/selftests/net/packetdrill/tcp_rcv_toobig.pkt b/tools/testing/selftests/net/packetdrill/tcp_rcv_toobig.pkt
new file mode 100644
index 0000000000000000000000000000000000000000..f575c0ff89da3c856208b315358c1c4a4c331d12
--- /dev/null
+++ b/tools/testing/selftests/net/packetdrill/tcp_rcv_toobig.pkt
@@ -0,0 +1,33 @@
+// SPDX-License-Identifier: GPL-2.0
+
+--mss=1000
+
+`./defaults.sh`
+
+    0 `nstat -n`
+
+// Establish a connection.
+   +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+   +0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [20000], 4) = 0
+   +0 bind(3, ..., ...) = 0
+   +0 listen(3, 1) = 0
+
+   +0 < S 0:0(0) win 32792 <mss 1000,nop,wscale 7>
+   +0 > S. 0:0(0) ack 1 win 18980 <mss 1460,nop,wscale 0>
+  +.1 < . 1:1(0) ack 1 win 257
+
+   +0 accept(3, ..., ...) = 4
+
+   +0 < P. 1:20001(20000) ack 1 win 257
+ +.04 > .  1:1(0) ack 20001 win 18000
+
+   +0 setsockopt(4, SOL_SOCKET, SO_RCVBUF, [12000], 4) = 0
+   +0 < P. 20001:80001(60000) ack 1 win 257
+   +0 > .  1:1(0) ack 20001 win 18000
+
+   +0 read(4, ..., 20000) = 20000
+// A too big packet is accepted if the receive queue is empty
+   +0 < P. 20001:80001(60000) ack 1 win 257
+   +0 > .  1:1(0) ack 80001 win 0
+
-- 
2.50.0.727.gbf7dc18ff4-goog


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 1/8] tcp: do not accept packets beyond window
  2025-07-11 11:39 ` [PATCH net-next 1/8] tcp: do not accept packets beyond window Eric Dumazet
@ 2025-07-12 20:52   ` Kuniyuki Iwashima
  2025-07-15  1:38   ` Jakub Kicinski
  1 sibling, 0 replies; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-12 20:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Simon Horman, Willem de Bruijn, netdev, eric.dumazet

On Fri, Jul 11, 2025 at 4:40 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Currently, TCP accepts incoming packets which might go beyond the
> offered RWIN.
>
> Add to tcp_sequence() the validation of packet end sequence.
>
> Add the corresponding check in the fast path.
>
> We relax this new constraint if the receive queue is empty,
> to not freeze flows from buggy peers.
>
> Add a new drop reason : SKB_DROP_REASON_TCP_INVALID_END_SEQUENCE.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 2/8] tcp: add LINUX_MIB_BEYOND_WINDOW
  2025-07-11 11:40 ` [PATCH net-next 2/8] tcp: add LINUX_MIB_BEYOND_WINDOW Eric Dumazet
@ 2025-07-12 20:55   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-12 20:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Simon Horman, Willem de Bruijn, netdev, eric.dumazet

On Fri, Jul 11, 2025 at 4:40 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Add a new SNMP MIB : LINUX_MIB_BEYOND_WINDOW
>
> Incremented when an incoming packet is received beyond the
> receiver window.
>
> nstat -az | grep TcpExtBeyondWindow
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 3/8] selftests/net: packetdrill: add tcp_rcv_big_endseq.pkt
  2025-07-11 11:40 ` [PATCH net-next 3/8] selftests/net: packetdrill: add tcp_rcv_big_endseq.pkt Eric Dumazet
@ 2025-07-12 20:58   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-12 20:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Simon Horman, Willem de Bruijn, netdev, eric.dumazet

On Fri, Jul 11, 2025 at 4:40 AM Eric Dumazet <edumazet@google.com> wrote:
>
> This test checks TCP behavior when receiving a packet beyond the window.
>
> It checks the new TcpExtBeyondWindow SNMP counter.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 4/8] tcp: call tcp_measure_rcv_mss() for ooo packets
  2025-07-11 11:40 ` [PATCH net-next 4/8] tcp: call tcp_measure_rcv_mss() for ooo packets Eric Dumazet
@ 2025-07-12 21:11   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-12 21:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Simon Horman, Willem de Bruijn, netdev, eric.dumazet

On Fri, Jul 11, 2025 at 4:40 AM Eric Dumazet <edumazet@google.com> wrote:
>
> tcp_measure_rcv_mss() is used to update icsk->icsk_ack.rcv_mss
> (tcpi_rcv_mss in tcp_info) and tp->scaling_ratio.
>
> Calling it from tcp_data_queue_ofo() makes sure these
> fields are updated, and permits a better tuning
> of sk->sk_rcvbuf, in the case a new flow receives many ooo
> packets.
>
> Fixes: dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale")
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 5/8] selftests/net: packetdrill: add tcp_ooo_rcv_mss.pkt
  2025-07-11 11:40 ` [PATCH net-next 5/8] selftests/net: packetdrill: add tcp_ooo_rcv_mss.pkt Eric Dumazet
@ 2025-07-12 21:42   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-12 21:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Simon Horman, Willem de Bruijn, netdev, eric.dumazet

On Fri, Jul 11, 2025 at 4:40 AM Eric Dumazet <edumazet@google.com> wrote:
>
> We make sure tcpi_rcv_mss and tp->scaling_ratio
> are correctly updated if no in-order packet has been received yet.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 6/8] tcp: add const to tcp_try_rmem_schedule() and sk_rmem_schedule() skb
  2025-07-11 11:40 ` [PATCH net-next 6/8] tcp: add const to tcp_try_rmem_schedule() and sk_rmem_schedule() skb Eric Dumazet
@ 2025-07-12 21:43   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-12 21:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Simon Horman, Willem de Bruijn, netdev, eric.dumazet

On Fri, Jul 11, 2025 at 4:40 AM Eric Dumazet <edumazet@google.com> wrote:
>
> These functions to not modify the skb, add a const qualifier.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 7/8] tcp: stronger sk_rcvbuf checks
  2025-07-11 11:40 ` [PATCH net-next 7/8] tcp: stronger sk_rcvbuf checks Eric Dumazet
@ 2025-07-12 21:54   ` Kuniyuki Iwashima
  2025-12-15 10:19   ` Christian Ebner
  2026-01-25 21:11   ` [regression] " Simon Baatz
  2 siblings, 0 replies; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-12 21:54 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Simon Horman, Willem de Bruijn, netdev, eric.dumazet

On Fri, Jul 11, 2025 at 4:40 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Currently, TCP stack accepts incoming packet if sizes of receive queues
> are below sk->sk_rcvbuf limit.
>
> This can cause memory overshoot if the packet is big, like an 1/2 MB
> BIG TCP one.
>
> Refine the check to take into account the incoming skb truesize.
>
> Note that we still accept the packet if the receive queue is empty,
> to not completely freeze TCP flows in pathological conditions.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 8/8] selftests/net: packetdrill: add tcp_rcv_toobig.pkt
  2025-07-11 11:40 ` [PATCH net-next 8/8] selftests/net: packetdrill: add tcp_rcv_toobig.pkt Eric Dumazet
@ 2025-07-12 21:57   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-12 21:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Simon Horman, Willem de Bruijn, netdev, eric.dumazet

On Fri, Jul 11, 2025 at 4:40 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Check that TCP receiver behavior after "tcp: stronger sk_rcvbuf checks"
>
> Too fat packet is dropped unless receive queue is empty.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 1/8] tcp: do not accept packets beyond window
  2025-07-11 11:39 ` [PATCH net-next 1/8] tcp: do not accept packets beyond window Eric Dumazet
  2025-07-12 20:52   ` Kuniyuki Iwashima
@ 2025-07-15  1:38   ` Jakub Kicinski
  1 sibling, 0 replies; 40+ messages in thread
From: Jakub Kicinski @ 2025-07-15  1:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Paolo Abeni, Neal Cardwell, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet

On Fri, 11 Jul 2025 11:39:59 +0000 Eric Dumazet wrote:
> -	/** @SKB_DROP_REASON_TCP_INVALID_SEQUENCE: Not acceptable SEQ field */
> +	/** @SKB_DROP_REASON_TCP_INVALID_SEQUENCE: Not acceptable SEQ field.
> +	 */
>  	SKB_DROP_REASON_TCP_INVALID_SEQUENCE,
> +	/** @SKB_DROP_REASON_TCP_INVALID_END_SEQUENCE: Not acceptable END_SEQ field.
> +	 */
> +	SKB_DROP_REASON_TCP_INVALID_END_SEQUENCE,

FWIW this is not valid kdoc. We can either do:

	/** @WORDS: bla bla bla */

or

	/**
	 * @WORDS: bla bla bla
	 */

but "networking inspired style":

	/** @WORDS: bla bla bla
	 */

is not allowed.

Ima fix for you when applying.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 0/8] tcp: receiver changes
  2025-07-11 11:39 [PATCH net-next 0/8] tcp: receiver changes Eric Dumazet
                   ` (7 preceding siblings ...)
  2025-07-11 11:40 ` [PATCH net-next 8/8] selftests/net: packetdrill: add tcp_rcv_toobig.pkt Eric Dumazet
@ 2025-07-15  2:20 ` patchwork-bot+netdevbpf
  2025-07-15  8:25 ` Paolo Abeni
  9 siblings, 0 replies; 40+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-07-15  2:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, kuba, pabeni, ncardwell, horms, kuniyu, willemb, netdev,
	eric.dumazet

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Fri, 11 Jul 2025 11:39:58 +0000 you wrote:
> Before accepting an incoming packet:
> 
> - Make sure to not accept a packet beyond advertized RWIN.
>   If not, increment a new SNMP counter (LINUX_MIB_BEYOND_WINDOW)
> 
> - ooo packets should update rcv_mss and tp->scaling_ratio.
> 
> [...]

Here is the summary with links:
  - [net-next,1/8] tcp: do not accept packets beyond window
    https://git.kernel.org/netdev/net-next/c/9ca48d616ed7
  - [net-next,2/8] tcp: add LINUX_MIB_BEYOND_WINDOW
    https://git.kernel.org/netdev/net-next/c/6c758062c64d
  - [net-next,3/8] selftests/net: packetdrill: add tcp_rcv_big_endseq.pkt
    https://git.kernel.org/netdev/net-next/c/f5fda1a86884
  - [net-next,4/8] tcp: call tcp_measure_rcv_mss() for ooo packets
    https://git.kernel.org/netdev/net-next/c/38d7e4443365
  - [net-next,5/8] selftests/net: packetdrill: add tcp_ooo_rcv_mss.pkt
    https://git.kernel.org/netdev/net-next/c/445e0cc38d49
  - [net-next,6/8] tcp: add const to tcp_try_rmem_schedule() and sk_rmem_schedule() skb
    https://git.kernel.org/netdev/net-next/c/75dff0584cce
  - [net-next,7/8] tcp: stronger sk_rcvbuf checks
    https://git.kernel.org/netdev/net-next/c/1d2fbaad7cd8
  - [net-next,8/8] selftests/net: packetdrill: add tcp_rcv_toobig.pkt
    https://git.kernel.org/netdev/net-next/c/906893cf2cf2

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 0/8] tcp: receiver changes
  2025-07-11 11:39 [PATCH net-next 0/8] tcp: receiver changes Eric Dumazet
                   ` (8 preceding siblings ...)
  2025-07-15  2:20 ` [PATCH net-next 0/8] tcp: receiver changes patchwork-bot+netdevbpf
@ 2025-07-15  8:25 ` Paolo Abeni
  2025-07-15  9:21   ` Matthieu Baerts
  9 siblings, 1 reply; 40+ messages in thread
From: Paolo Abeni @ 2025-07-15  8:25 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell, Matthieu Baerts (NGI0)
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, David S . Miller, Jakub Kicinski

On 7/11/25 1:39 PM, Eric Dumazet wrote:
> Before accepting an incoming packet:
> 
> - Make sure to not accept a packet beyond advertized RWIN.
>   If not, increment a new SNMP counter (LINUX_MIB_BEYOND_WINDOW)
> 
> - ooo packets should update rcv_mss and tp->scaling_ratio.
> 
> - Make sure to not accept packet beyond sk_rcvbuf limit.
> 
> This series includes three associated packetdrill tests.

I suspect this series is causing pktdrill failures for the
tcp_rcv_big_endseq.pkt test case:

# selftests: net/packetdrill: tcp_rcv_big_endseq.pkt
# TAP version 13
# 1..2
# tcp_rcv_big_endseq.pkt:41: error handling packet: timing error:
expected outbound packet at 1.347964 sec but happened at 1.307939 sec;
tolerance 0.014000 sec
# script packet:  1.347964 . 1:1(0) ack 54001 win 0
# actual packet:  1.307939 . 1:1(0) ack 54001 win 0
# not ok 1 ipv4
# tcp_rcv_big_endseq.pkt:41: error handling packet: timing error:
expected outbound packet at 1.354946 sec but happened at 1.314923 sec;
tolerance 0.014000 sec
# script packet:  1.354946 . 1:1(0) ack 54001 win 0
# actual packet:  1.314923 . 1:1(0) ack 54001 win 0
# not ok 2 ipv6
# # Totals: pass:0 fail:2 xfail:0 xpass:0 skip:0 error:0

the event is happening _before_ the expected time, I guess it's more a
functional issue than a timing one.

I also suspect this series is causing flakes in mptcp tests, i.e.:

# INFO: disconnect
# 63 ns1 MPTCP -> ns1 (10.0.1.1:20001      ) MPTCP     (duration
227ms) [ OK ]
# 64 ns1 MPTCP -> ns1 (10.0.1.1:20002      ) TCP       (duration
96ms) [ OK ]
# 65 ns1 TCP   -> ns1 (10.0.1.1:20003      ) MPTCP     copyfd_io_poll:
poll timed out (events: POLLIN 0, POLLOUT 4)
# copyfd_io_poll: poll timed out (events: POLLIN 1, POLLOUT 0)
# (duration 30318ms) [FAIL] client exit code 2, server 0
#
# netns ns1-VslcTV (listener) socket stat for 20003:
# Netid State      Recv-Q Send-Q Local Address:Port  Peer Address:Port

# tcp   FIN-WAIT-2 0      0           10.0.1.1:20003     10.0.1.1:60698
timer:(timewait,59sec,0) ino:0 sk:1012
#
# tcp   TIME-WAIT  0      0           10.0.1.1:20003     10.0.1.1:60696
timer:(timewait,29sec,0) ino:0 sk:1013
#
# TcpActiveOpens                  3                  0.0
# TcpPassiveOpens                 3                  0.0
# TcpInSegs                       1472               0.0
# TcpOutSegs                      1471               0.0
# TcpRetransSegs                  3                  0.0
# TcpExtPruneCalled               4                  0.0
# TcpExtRcvPruned                 3                  0.0
# TcpExtTW                        3                  0.0
# TcpExtBeyondWindow              7                  0.0
# TcpExtTCPHPHits                 34                 0.0
# TcpExtTCPPureAcks               386                0.0
# TcpExtTCPHPAcks                 33                 0.0
# TcpExtTCPSackRecovery           1                  0.0
# TcpExtTCPFastRetrans            1                  0.0
# TcpExtTCPLossProbes             2                  0.0
# TcpExtTCPLossProbeRecovery      1                  0.0
# TcpExtTCPRcvCollapsed           3                  0.0
# TcpExtTCPBacklogCoalesce        261                0.0
# TcpExtTCPSackShiftFallback      1                  0.0
# TcpExtTCPRcvCoalesce            500                0.0
# TcpExtTCPOFOQueue               1                  0.0
# TcpExtTCPFromZeroWindowAdv      60                 0.0
# TcpExtTCPToZeroWindowAdv        58                 0.0
# TcpExtTCPWantZeroWindowAdv      296                0.0
# TcpExtTCPOrigDataSent           1038               0.0
# TcpExtTCPHystartTrainDetect     1                  0.0
# TcpExtTCPHystartTrainCwnd       16                 0.0
# TcpExtTCPACKSkippedSeq          1                  0.0
# TcpExtTCPWinProbe               7                  0.0
# TcpExtTCPDelivered              1041               0.0
# TcpExtTCPRcvQDrop               2                  0.0
#
# netns ns1-VslcTV (connector) socket stat for 20003:
# Failed to find cgroup2 mount
# Failed to find cgroup2 mount
# Netid State     Recv-Q Send-Q  Local Address:Port  Peer Address:Port

# tcp   TIME-WAIT 0      0            10.0.1.1:60684     10.0.1.1:20003
timer:(timewait,29sec,0) ino:0 sk:11
#
# tcp   LAST-ACK  0      1735147      10.0.1.1:60698     10.0.1.1:20003
timer:(persist,22sec,0) ino:0 sk:12 cgroup:unreachable:1 ---
#  skmem:(r0,rb361100,t0,tb2626560,f2838,w1758442,o0,bl0,d61) ts sack
cubic wscale:7,7 rto:201 backoff:7 rtt:0.12/0.215 ato:40 mss:65483
pmtu:65535 rcvmss:65483 advmss:65483 cwnd:7 ssthresh:7
bytes_sent:1738187 bytes_retrans:65461 bytes_acked:1672727
bytes_received:7659224 segs_out:180 segs_in:243 data_segs_out:103
data_segs_in:221 send 30558733333bps lastsnd:30125 lastrcv:30322
lastack:3693 pacing_rate 36480477512bps delivery_rate 196449000000bps
delivered:103 app_limited busy:30351ms rwnd_limited:30350ms(100.0%)
retrans:0/1 rcv_rtt:0.005 rcv_space:289974 rcv_ssthresh:324480
notsent:1735147 minrtt:0.001 rcv_wnd:324480

@Matttbe: can you reproduce the flakes locally? if so, does reverting
that series stop them? (not that I'm planning a revert, just to validate
my guess).

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 0/8] tcp: receiver changes
  2025-07-15  8:25 ` Paolo Abeni
@ 2025-07-15  9:21   ` Matthieu Baerts
  2025-07-15 10:14     ` Paolo Abeni
  0 siblings, 1 reply; 40+ messages in thread
From: Matthieu Baerts @ 2025-07-15  9:21 UTC (permalink / raw)
  To: Paolo Abeni, Eric Dumazet, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, David S . Miller, Jakub Kicinski

Hi Paolo,

Thank you for having CCed me!

On 15/07/2025 10:25, Paolo Abeni wrote:
> On 7/11/25 1:39 PM, Eric Dumazet wrote:
>> Before accepting an incoming packet:
>>
>> - Make sure to not accept a packet beyond advertized RWIN.
>>   If not, increment a new SNMP counter (LINUX_MIB_BEYOND_WINDOW)
>>
>> - ooo packets should update rcv_mss and tp->scaling_ratio.
>>
>> - Make sure to not accept packet beyond sk_rcvbuf limit.
>>
>> This series includes three associated packetdrill tests.
> 
> I suspect this series is causing pktdrill failures for the
> tcp_rcv_big_endseq.pkt test case:

(Note that this series introduces this new pktdrill test)

(...)
> the event is happening _before_ the expected time, I guess it's more a
> functional issue than a timing one.
> 
> I also suspect this series is causing flakes in mptcp tests, i.e.:

(...)

> @Matttbe: can you reproduce the flakes locally? if so, does reverting
> that series stop them? (not that I'm planning a revert, just to validate
> my guess).

I'm trying to reproduce this locally on top of net-next, no luck so far.
I will also continue to monitor the MPTCP CI.

For the moment, I don't think it might be linked to this series: NIPA is
validating it since the 11th, and the issues only appeared last night.
Plus, I recently added new MPTCP selftests running these tests in 3
additional modes. If this flake was present for a long time, it might be
more visible today.

Eventually, because the failure is due to a poll timed out, and other
unrelated tests have failed at that time too, could it be due to
overloaded test machines?

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 0/8] tcp: receiver changes
  2025-07-15  9:21   ` Matthieu Baerts
@ 2025-07-15 10:14     ` Paolo Abeni
  2025-07-15 10:40       ` Matthieu Baerts
  2025-07-15 13:28       ` Jakub Kicinski
  0 siblings, 2 replies; 40+ messages in thread
From: Paolo Abeni @ 2025-07-15 10:14 UTC (permalink / raw)
  To: Matthieu Baerts, Eric Dumazet, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, David S . Miller, Jakub Kicinski

On 7/15/25 11:21 AM, Matthieu Baerts wrote:
> On 15/07/2025 10:25, Paolo Abeni wrote:
>> @Matttbe: can you reproduce the flakes locally? if so, does reverting
>> that series stop them? (not that I'm planning a revert, just to validate
>> my guess).
> 
> I'm trying to reproduce this locally on top of net-next, no luck so far.
> I will also continue to monitor the MPTCP CI.
> 
> For the moment, I don't think it might be linked to this series: 

Agreed. I did not notice the pending mptcp patches, which are a more
relevant suspect here.

> Eventually, because the failure is due to a poll timed out, and other
> unrelated tests have failed at that time too, could it be due to
> overloaded test machines?

Not for a 60s timeout, I guess :-P

/P



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 0/8] tcp: receiver changes
  2025-07-15 10:14     ` Paolo Abeni
@ 2025-07-15 10:40       ` Matthieu Baerts
  2025-07-15 13:28       ` Jakub Kicinski
  1 sibling, 0 replies; 40+ messages in thread
From: Matthieu Baerts @ 2025-07-15 10:40 UTC (permalink / raw)
  To: Paolo Abeni, Eric Dumazet, Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, David S . Miller, Jakub Kicinski

Hi Paolo,

On 15/07/2025 12:14, Paolo Abeni wrote:
> On 7/15/25 11:21 AM, Matthieu Baerts wrote:
>> On 15/07/2025 10:25, Paolo Abeni wrote:
>>> @Matttbe: can you reproduce the flakes locally? if so, does reverting
>>> that series stop them? (not that I'm planning a revert, just to validate
>>> my guess).
>>
>> I'm trying to reproduce this locally on top of net-next, no luck so far.
>> I will also continue to monitor the MPTCP CI.
>>
>> For the moment, I don't think it might be linked to this series: 
> 
> Agreed. I did not notice the pending mptcp patches, which are a more
> relevant suspect here.
> 
>> Eventually, because the failure is due to a poll timed out, and other
>> unrelated tests have failed at that time too, could it be due to
>> overloaded test machines?
> 
> Not for a 60s timeout, I guess :-P

:)

The poll timeout is set to 10s I think. But yes, it is still too long to
be caused by an overloaded test machine I suppose.

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 0/8] tcp: receiver changes
  2025-07-15 10:14     ` Paolo Abeni
  2025-07-15 10:40       ` Matthieu Baerts
@ 2025-07-15 13:28       ` Jakub Kicinski
  2025-07-15 13:33         ` Jakub Kicinski
  2025-07-15 13:50         ` Paolo Abeni
  1 sibling, 2 replies; 40+ messages in thread
From: Jakub Kicinski @ 2025-07-15 13:28 UTC (permalink / raw)
  To: Paolo Abeni, Neal Cardwell
  Cc: Matthieu Baerts, Eric Dumazet, Simon Horman, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet, David S . Miller

On Tue, 15 Jul 2025 12:14:34 +0200 Paolo Abeni wrote:
> > Eventually, because the failure is due to a poll timed out, and other
> > unrelated tests have failed at that time too, could it be due to
> > overloaded test machines?  
> 
> Not for a 60s timeout, I guess :-P

I think the timeout may be packetdrill-version related.
I tried with the Fedora packetdrill and the test times out.
With packetdrill built from source on my laptop I get:

# (null):17: error handling packet: timing error: expected outbound packet at 0.074144 sec but happened at -1752585909.757339 sec; tolerance 0.004000 sec
# script packet:  0.074144 S. 0:0(0) ack 1 <mss 1460,nop,wscale 0>
# actual packet: -1752585909.757339 S.0 0:0(0) ack 1 <mss 1460,nop,wscale 0>

:o

But the CI just gets the failure Paolo quoted.

I'm leaning towards Eric using a different packetdrill, and/or this
being packetdrill / compiler related. On Fedora I'm hitting this build
failure which may explain why the distro hasn't updated recently:

cc -g -Wall -Werror   -c -o code.o code.c
In file included from code.h:29,
                 from code.c:26:
types.h:64:12: error: two or more data types in declaration specifiers
   64 | typedef u8 bool;
      |            ^~~~
types.h:64:1: error: useless type name in empty declaration [-Werror]
   64 | typedef u8 bool;
      | ^~~~~~~
types.h:66:9: error: cannot use keyword ‘false’ as enumeration constant
   66 |         false = 0,
      |         ^~~~~
types.h:66:9: note: ‘false’ is a keyword with ‘-std=c23’ onwards
cc1: all warnings being treated as errors
make: *** [<builtin>: code.o] Error 1


Neal?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 0/8] tcp: receiver changes
  2025-07-15 13:28       ` Jakub Kicinski
@ 2025-07-15 13:33         ` Jakub Kicinski
  2025-07-15 13:52           ` Paolo Abeni
  2025-07-15 14:48           ` Kuniyuki Iwashima
  2025-07-15 13:50         ` Paolo Abeni
  1 sibling, 2 replies; 40+ messages in thread
From: Jakub Kicinski @ 2025-07-15 13:33 UTC (permalink / raw)
  To: Paolo Abeni, Neal Cardwell
  Cc: Matthieu Baerts, Eric Dumazet, Simon Horman, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet, David S . Miller

On Tue, 15 Jul 2025 06:28:29 -0700 Jakub Kicinski wrote:
> # (null):17: error handling packet: timing error: expected outbound packet at 0.074144 sec but happened at -1752585909.757339 sec; tolerance 0.004000 sec
> # script packet:  0.074144 S. 0:0(0) ack 1 <mss 1460,nop,wscale 0>
> # actual packet: -1752585909.757339 S.0 0:0(0) ack 1 <mss 1460,nop,wscale 0>

This is definitely compiler related, I rebuilt with clang and the build
error goes away. Now I get a more sane failure:

# tcp_rcv_big_endseq.pkt:41: error handling packet: timing error: expected outbound packet at 1.230105 sec but happened at 1.190101 sec; tolerance 0.005046 sec
# script packet:  1.230105 . 1:1(0) ack 54001 win 0 
# actual packet:  1.190101 . 1:1(0) ack 54001 win 0 

$ gcc --version
gcc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2)

I don't understand why the ack is supposed to be delayed, should we
just do this? (I think Eric is OOO, FWIW)

diff --git a/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt b/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
index 7e170b94fd36..3848b419e68c 100644
--- a/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
+++ b/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
@@ -38,7 +38,7 @@
 
 // If queue is empty, accept a packet even if its end_seq is above wup + rcv_wnd
   +0 < P. 4001:54001(50000) ack 1 win 257
-  +.040 > .  1:1(0) ack 54001 win 0
+  +0 > .  1:1(0) ack 54001 win 0
 
 // Check LINUX_MIB_BEYOND_WINDOW has been incremented 3 times.
 +0 `nstat | grep TcpExtBeyondWindow | grep -q " 3 "`

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 0/8] tcp: receiver changes
  2025-07-15 13:28       ` Jakub Kicinski
  2025-07-15 13:33         ` Jakub Kicinski
@ 2025-07-15 13:50         ` Paolo Abeni
  1 sibling, 0 replies; 40+ messages in thread
From: Paolo Abeni @ 2025-07-15 13:50 UTC (permalink / raw)
  To: Jakub Kicinski, Neal Cardwell
  Cc: Matthieu Baerts, Eric Dumazet, Simon Horman, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet, David S . Miller

On 7/15/25 3:28 PM, Jakub Kicinski wrote:
> On Tue, 15 Jul 2025 12:14:34 +0200 Paolo Abeni wrote:
>>> Eventually, because the failure is due to a poll timed out, and other
>>> unrelated tests have failed at that time too, could it be due to
>>> overloaded test machines?  
>>
>> Not for a 60s timeout, I guess :-P

FTR, the above was referred to the mptcp selftest failure/timeout.

/P


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 0/8] tcp: receiver changes
  2025-07-15 13:33         ` Jakub Kicinski
@ 2025-07-15 13:52           ` Paolo Abeni
  2025-07-15 14:54             ` Jakub Kicinski
  2025-07-15 14:48           ` Kuniyuki Iwashima
  1 sibling, 1 reply; 40+ messages in thread
From: Paolo Abeni @ 2025-07-15 13:52 UTC (permalink / raw)
  To: Jakub Kicinski, Neal Cardwell
  Cc: Matthieu Baerts, Eric Dumazet, Simon Horman, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet, David S . Miller

On 7/15/25 3:33 PM, Jakub Kicinski wrote:
> On Tue, 15 Jul 2025 06:28:29 -0700 Jakub Kicinski wrote:
>> # (null):17: error handling packet: timing error: expected outbound packet at 0.074144 sec but happened at -1752585909.757339 sec; tolerance 0.004000 sec
>> # script packet:  0.074144 S. 0:0(0) ack 1 <mss 1460,nop,wscale 0>
>> # actual packet: -1752585909.757339 S.0 0:0(0) ack 1 <mss 1460,nop,wscale 0>
> 
> This is definitely compiler related, I rebuilt with clang and the build
> error goes away. Now I get a more sane failure:
> 
> # tcp_rcv_big_endseq.pkt:41: error handling packet: timing error: expected outbound packet at 1.230105 sec but happened at 1.190101 sec; tolerance 0.005046 sec
> # script packet:  1.230105 . 1:1(0) ack 54001 win 0 
> # actual packet:  1.190101 . 1:1(0) ack 54001 win 0 
> 
> $ gcc --version
> gcc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2)
> 
> I don't understand why the ack is supposed to be delayed, should we
> just do this? (I think Eric is OOO, FWIW)
> 
> diff --git a/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt b/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
> index 7e170b94fd36..3848b419e68c 100644
> --- a/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
> +++ b/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
> @@ -38,7 +38,7 @@
>  
>  // If queue is empty, accept a packet even if its end_seq is above wup + rcv_wnd
>    +0 < P. 4001:54001(50000) ack 1 win 257
> -  +.040 > .  1:1(0) ack 54001 win 0
> +  +0 > .  1:1(0) ack 54001 win 0
>  
>  // Check LINUX_MIB_BEYOND_WINDOW has been incremented 3 times.
>  +0 `nstat | grep TcpExtBeyondWindow | grep -q " 3 "`

The above looks sane to me, but I Neal or Willem ack would be appreciated.

/P





^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 0/8] tcp: receiver changes
  2025-07-15 13:33         ` Jakub Kicinski
  2025-07-15 13:52           ` Paolo Abeni
@ 2025-07-15 14:48           ` Kuniyuki Iwashima
  1 sibling, 0 replies; 40+ messages in thread
From: Kuniyuki Iwashima @ 2025-07-15 14:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Paolo Abeni, Neal Cardwell, Matthieu Baerts, Eric Dumazet,
	Simon Horman, Willem de Bruijn, netdev, eric.dumazet,
	David S . Miller

On Tue, Jul 15, 2025 at 6:33 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 15 Jul 2025 06:28:29 -0700 Jakub Kicinski wrote:
> > # (null):17: error handling packet: timing error: expected outbound packet at 0.074144 sec but happened at -1752585909.757339 sec; tolerance 0.004000 sec
> > # script packet:  0.074144 S. 0:0(0) ack 1 <mss 1460,nop,wscale 0>
> > # actual packet: -1752585909.757339 S.0 0:0(0) ack 1 <mss 1460,nop,wscale 0>
>
> This is definitely compiler related, I rebuilt with clang and the build
> error goes away. Now I get a more sane failure:
>
> # tcp_rcv_big_endseq.pkt:41: error handling packet: timing error: expected outbound packet at 1.230105 sec but happened at 1.190101 sec; tolerance 0.005046 sec
> # script packet:  1.230105 . 1:1(0) ack 54001 win 0
> # actual packet:  1.190101 . 1:1(0) ack 54001 win 0
>
> $ gcc --version
> gcc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2)
>
> I don't understand why the ack is supposed to be delayed, should we
> just do this? (I think Eric is OOO, FWIW)
>
> diff --git a/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt b/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
> index 7e170b94fd36..3848b419e68c 100644
> --- a/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
> +++ b/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
> @@ -38,7 +38,7 @@
>
>  // If queue is empty, accept a packet even if its end_seq is above wup + rcv_wnd
>    +0 < P. 4001:54001(50000) ack 1 win 257
> -  +.040 > .  1:1(0) ack 54001 win 0
> +  +0 > .  1:1(0) ack 54001 win 0
>
>  // Check LINUX_MIB_BEYOND_WINDOW has been incremented 3 times.
>  +0 `nstat | grep TcpExtBeyondWindow | grep -q " 3 "`

I remember I didn't see this error just after the commit that added the test,
and now I see the failure after commit 1d2fbaad7cd8c ("tcp: stronger
sk_rcvbuf checks").

[root@fedora packetdrill]# uname -r
6.16.0-rc5-01431-g75dff0584cce
[root@fedora packetdrill]# ./ksft_runner.sh tcp_rcv_big_endseq.pkt
TAP version 13
1..2
ok 1 ipv4
ok 2 ipv6
# Totals: pass:2 fail:0 xfail:0 xpass:0 skip:0 error:0

[root@fedora packetdrill]# uname -r
6.16.0-rc5-01432-g1d2fbaad7cd8
[root@fedora packetdrill]# ./ksft_runner.sh tcp_rcv_big_endseq.pkt
TAP version 13
1..2
tcp_rcv_big_endseq.pkt:41: error handling packet: timing error:
expected outbound packet at 1.148682 sec but happened at 1.108681 sec;
tolerance 0.005005 sec
script packet:  1.148682 . 1:1(0) ack 54001 win 0
actual packet:  1.108681 . 1:1(0) ack 54001 win 0
not ok 1 ipv4
tcp_rcv_big_endseq.pkt:41: error handling packet: timing error:
expected outbound packet at 1.146130 sec but happened at 1.106130 sec;
tolerance 0.005005 sec
script packet:  1.146130 . 1:1(0) ack 54001 win 0
actual packet:  1.106130 . 1:1(0) ack 54001 win 0
not ok 2 ipv6
# Totals: pass:0 fail:2 xfail:0 xpass:0 skip:0 error:0


On 75dff0584cce, the test failed if I removed the delay.
I haven't checked where it comes from, but probably that's
why Eric added the delay ?

[root@fedora packetdrill]# ./ksft_runner.sh tcp_rcv_big_endseq.pkt
TAP version 13
1..2
tcp_rcv_big_endseq.pkt:41: error handling packet: timing error:
expected outbound packet at 1.105941 sec but happened at 1.146774 sec;
tolerance 0.004000 sec
script packet:  1.105941 . 1:1(0) ack 54001 win 0
actual packet:  1.146774 . 1:1(0) ack 54001 win 0
not ok 1 ipv4
tcp_rcv_big_endseq.pkt:41: error handling packet: timing error:
expected outbound packet at 1.106215 sec but happened at 1.146815 sec;
tolerance 0.004000 sec
script packet:  1.106215 . 1:1(0) ack 54001 win 0
actual packet:  1.146815 . 1:1(0) ack 54001 win 0

not ok 2 ipv6
# Totals: pass:0 fail:2 xfail:0 xpass:0 skip:0 error:0

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 0/8] tcp: receiver changes
  2025-07-15 13:52           ` Paolo Abeni
@ 2025-07-15 14:54             ` Jakub Kicinski
  0 siblings, 0 replies; 40+ messages in thread
From: Jakub Kicinski @ 2025-07-15 14:54 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Neal Cardwell, Matthieu Baerts, Eric Dumazet, Simon Horman,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	David S . Miller

On Tue, 15 Jul 2025 15:52:33 +0200 Paolo Abeni wrote:
> On 7/15/25 3:33 PM, Jakub Kicinski wrote:
> > On Tue, 15 Jul 2025 06:28:29 -0700 Jakub Kicinski wrote:  
> >> # (null):17: error handling packet: timing error: expected outbound packet at 0.074144 sec but happened at -1752585909.757339 sec; tolerance 0.004000 sec
> >> # script packet:  0.074144 S. 0:0(0) ack 1 <mss 1460,nop,wscale 0>
> >> # actual packet: -1752585909.757339 S.0 0:0(0) ack 1 <mss 1460,nop,wscale 0>  
> > 
> > This is definitely compiler related, I rebuilt with clang and the build
> > error goes away. Now I get a more sane failure:
> > 
> > # tcp_rcv_big_endseq.pkt:41: error handling packet: timing error: expected outbound packet at 1.230105 sec but happened at 1.190101 sec; tolerance 0.005046 sec
> > # script packet:  1.230105 . 1:1(0) ack 54001 win 0 
> > # actual packet:  1.190101 . 1:1(0) ack 54001 win 0 
> > 
> > $ gcc --version
> > gcc (GCC) 15.1.1 20250521 (Red Hat 15.1.1-2)
> > 
> > I don't understand why the ack is supposed to be delayed, should we
> > just do this? (I think Eric is OOO, FWIW)
> > 
> > diff --git a/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt b/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
> > index 7e170b94fd36..3848b419e68c 100644
> > --- a/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
> > +++ b/tools/testing/selftests/net/packetdrill/tcp_rcv_big_endseq.pkt
> > @@ -38,7 +38,7 @@
> >  
> >  // If queue is empty, accept a packet even if its end_seq is above wup + rcv_wnd
> >    +0 < P. 4001:54001(50000) ack 1 win 257
> > -  +.040 > .  1:1(0) ack 54001 win 0
> > +  +0 > .  1:1(0) ack 54001 win 0
> >  
> >  // Check LINUX_MIB_BEYOND_WINDOW has been incremented 3 times.
> >  +0 `nstat | grep TcpExtBeyondWindow | grep -q " 3 "`  
> 
> The above looks sane to me, but I Neal or Willem ack would be appreciated.

Posted officially here to get it queued to the CI already:
https://lore.kernel.org/all/20250715142849.959444-1-kuba@kernel.org/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 7/8] tcp: stronger sk_rcvbuf checks
  2025-07-11 11:40 ` [PATCH net-next 7/8] tcp: stronger sk_rcvbuf checks Eric Dumazet
  2025-07-12 21:54   ` Kuniyuki Iwashima
@ 2025-12-15 10:19   ` Christian Ebner
  2025-12-18  9:31     ` Christian Ebner
  2026-01-25 21:11   ` [regression] " Simon Baatz
  2 siblings, 1 reply; 40+ messages in thread
From: Christian Ebner @ 2025-12-15 10:19 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
	Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet

Hi,

some of our users (Proxmox Backup Server) are seeing issues with slow
and stale backups on kernel versions 6.17 and 6.18, especially in
combination with MTU 9000. The issue persists also with the mainline
kernel v6.18. Backups are running over a single TCP connection using
HTTP/2 based protocol, the stall affects only the single TCP connection
while the rest of the network is unaffected. Also, other network
traffic does not reproduce the issue so far.

When reverting to older kernel versions the issue disappears [0].
Unfortunately the stale connections are not easily reproduced.

In an effort to identify the issue, bisection lead us and independently
an affected user [1] to this commit:

"1d2fbaad: tcp: stronger sk_rcvbuf checks"

Taking note that there were several patches with bugfixes and
additional adaptions, we are reaching out in order to ask for guidance
on how to best debug this issue further, given that it persists also
with the latest stable kernel.

What outputs could we provide to narrow down the possible root cause
of the stale TCP connections?

Output from `ss` and `nstat` gathered during 2 stale connections as
provided by an affected user [2]:
```
State                                           Recv-Q 
                          Send-Q 
                                                      Local Address:Port 
  
                        Peer Address:Port 
  
  
  

ESTAB                                           0 
                          0 
                                               [::ffff:10.x.y.a]:8007 
  
             [::ffff:10.x.y.c]:48288
          cubic wscale:7,10 rto:207 rtt:6.582/11.374 ato:40 mss:8948 
pmtu:9000 rcvmss:3072 advmss:8948 cwnd:10 ssthresh:16 bytes_sent:1084107 
bytes_retrans:123 bytes_acked:1083984 bytes_received:3703857790 
segs_out:317478 segs_in:315112 data_segs_out:2343 data_segs_in:314619 
send 109Mbps lastsnd:423225 lastrcv:76 lastack:76 pacing_rate 131Mbps 
delivery_rate 3.33Gbps delivered:2344 app_limited busy:3592ms 
retrans:0/1 dsack_dups:1 rcv_rtt:207.33 rcv_space:146392 
rcv_ssthresh:592739 minrtt:0.044 rcv_ooopack:890 snd_wnd:1065728 
rcv_wnd:3072
ESTAB                                           0 
                          0 
                                               [::ffff:10.x.y.a]:8007 
  
             [::ffff:10.x.y.b]:46712
          cubic wscale:10,10 rto:201 rtt:0.333/0.496 ato:40 mss:8948 
pmtu:9000 rcvmss:4096 advmss:8948 cwnd:10 bytes_sent:861063 
bytes_acked:861063 bytes_received:181206715 segs_out:17834 segs_in:17552 
data_segs_out:382 data_segs_in:17280 send 2.15Gbps lastsnd:53439 
lastrcv:191 lastack:191 pacing_rate 4.29Gbps delivery_rate 2.95Gbps 
delivered:383 app_limited busy:405ms rcv_rtt:207.33 rcv_space:95745 
rcv_ssthresh:246825 minrtt:0.04 rcv_ooopack:75 snd_wnd:193536 rcv_wnd:4096
```

```
#kernel
IpInReceives                    18674              0.0
IpInDelivers                    18672              0.0
IpOutRequests                   21147              0.0
IpOutTransmits                  21147              0.0
TcpActiveOpens                  806                0.0
TcpPassiveOpens                 1052               0.0
TcpAttemptFails                 280                0.0
TcpInSegs                       18607              0.0
TcpOutSegs                      22190              0.0
TcpRetransSegs                  40                 0.0
TcpOutRsts                      280                0.0
UdpInDatagrams                  10                 0.0
UdpOutDatagrams                 31                 0.0
UdpIgnoredMulti                 17                 0.0
Ip6InReceives                   37                 0.0
Ip6InDiscards                   37                 0.0
Ip6InOctets                     2664               0.0
TcpExtTW                        526                0.0
TcpExtTWRecycled                2                  0.0
TcpExtDelayedACKs               19                 0.0
TcpExtDelayedACKLost            1                  0.0
TcpExtTCPHPHits                 1065               0.0
TcpExtTCPPureAcks               2901               0.0
TcpExtTCPHPAcks                 2010               0.0
TcpExtTCPSackRecovery           6                  0.0
TcpExtTCPSACKReorder            2                  0.0
TcpExtTCPLostRetransmit         2                  0.0
TcpExtTCPFastRetrans            40                 0.0
TcpExtTCPBacklogCoalesce        4                  0.0
TcpExtTCPDSACKOldSent           1                  0.0
TcpExtTCPSackShifted            5                  0.0
TcpExtTCPSackMerged             16                 0.0
TcpExtTCPSackShiftFallback      13                 0.0
TcpExtTCPRcvCoalesce            65                 0.0
TcpExtTCPAutoCorking            77                 0.0
TcpExtTCPFromZeroWindowAdv      2946               0.0
TcpExtTCPToZeroWindowAdv        248                0.0
TcpExtTCPOrigDataSent           8414               0.0
TcpExtTCPDelivered              7886               0.0
IpExtInMcastPkts                38                 0.0
IpExtInBcastPkts                17                 0.0
IpExtInOctets                   22147530           0.0
IpExtOutOctets                  16300295           0.0
IpExtInMcastOctets              1216               0.0
IpExtInBcastOctets              2755               0.0
IpExtInNoECTPkts                18663              0.0
IpExtInECT0Pkts                 11                 0.0
```

[0] https://forum.proxmox.com/threads/176444/
[1] https://forum.proxmox.com/threads/176444/post-824615
[2] https://forum.proxmox.com/threads/176444/post-824407

Best regards,
Christian Ebner


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 7/8] tcp: stronger sk_rcvbuf checks
  2025-12-15 10:19   ` Christian Ebner
@ 2025-12-18  9:31     ` Christian Ebner
  2025-12-18 10:10       ` Eric Dumazet
  0 siblings, 1 reply; 40+ messages in thread
From: Christian Ebner @ 2025-12-18  9:31 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni,
	Neal Cardwell
  Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, lkolbe

Hi,
to add some more information gained.

pcaps obtained via tcpdump of the traffic while in a stale state show 
the following recurring pattern:

41	0.705618	10.xx.xx.aa	10.xx.xx.bb	TCP	66	[TCP ZeroWindow] 8007 → 55554 
[ACK] Seq=1 Ack=28673 Win=0 Len=0 TSval=2656874280 TSecr=1348075902
42	0.705662	10.xx.xx.aa	10.xx.xx.bb	TCP	66	[TCP Window Update] 8007 → 
55554 [ACK] Seq=1 Ack=28673 Win=7 Len=0 TSval=2656874280 TSecr=1348075902
90	0.914606	10.xx.xx.bb	10.xx.xx.aa	TCP	7234	55554 → 8007 [PSH, ACK] 
Seq=28673 Ack=1 Win=139 Len=7168 TSval=1348076111 TSecr=2656874280

Output of `ss -tim` show the sockets being severely limited in buffer size:

ESTAB                          0                               0 
  
[::ffff:10.xx.xx.aa]:8007 
       [::ffff:10.xx.xx.bb]:55554
          skmem:(r0,rb7488,t0,tb332800,f0,w0,o0,bl0,d20) cubic 
wscale:10,10 rto:201 rtt:0.085/0.015 ato:40 mss:8948 pmtu:9000 
rcvmss:7168 advmss:8948 cwnd:10 bytes_sent:937478 bytes_acked:937478 
bytes_received:1295747055 segs_out:301010 segs_in:162410 
data_segs_out:1035 data_segs_in:161588 send 8.42Gbps lastsnd:3308 
lastrcv:191 lastack:191 pacing_rate 16.7Gbps delivery_rate 2.74Gbps 
delivered:1036 app_limited busy:437ms rcv_rtt:207.551 rcv_space:96242 
rcv_ssthresh:903417 minrtt:0.049 rcv_ooopack:23 snd_wnd:142336 rcv_wnd:7168

This would indicate that the buffer size not growing while in this 
state, therefore limiting the rcv_wnd?

Best regards,
Christian Ebner


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 7/8] tcp: stronger sk_rcvbuf checks
  2025-12-18  9:31     ` Christian Ebner
@ 2025-12-18 10:10       ` Eric Dumazet
  2025-12-18 12:28         ` Christian Ebner
  0 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2025-12-18 10:10 UTC (permalink / raw)
  To: Christian Ebner
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, lkolbe

On Thu, Dec 18, 2025 at 10:31 AM Christian Ebner <c.ebner@proxmox.com> wrote:
>
> Hi,
> to add some more information gained.
>
> pcaps obtained via tcpdump of the traffic while in a stale state show
> the following recurring pattern:
>
> 41      0.705618        10.xx.xx.aa     10.xx.xx.bb     TCP     66      [TCP ZeroWindow] 8007 → 55554
> [ACK] Seq=1 Ack=28673 Win=0 Len=0 TSval=2656874280 TSecr=1348075902
> 42      0.705662        10.xx.xx.aa     10.xx.xx.bb     TCP     66      [TCP Window Update] 8007 →
> 55554 [ACK] Seq=1 Ack=28673 Win=7 Len=0 TSval=2656874280 TSecr=1348075902
> 90      0.914606        10.xx.xx.bb     10.xx.xx.aa     TCP     7234    55554 → 8007 [PSH, ACK]
> Seq=28673 Ack=1 Win=139 Len=7168 TSval=1348076111 TSecr=2656874280
>
> Output of `ss -tim` show the sockets being severely limited in buffer size:
>
> ESTAB                          0                               0
>
> [::ffff:10.xx.xx.aa]:8007
>        [::ffff:10.xx.xx.bb]:55554
>           skmem:(r0,rb7488,t0,tb332800,f0,w0,o0,bl0,d20) cubic
> wscale:10,10 rto:201 rtt:0.085/0.015 ato:40 mss:8948 pmtu:9000
> rcvmss:7168 advmss:8948 cwnd:10 bytes_sent:937478 bytes_acked:937478
> bytes_received:1295747055 segs_out:301010 segs_in:162410
> data_segs_out:1035 data_segs_in:161588 send 8.42Gbps lastsnd:3308
> lastrcv:191 lastack:191 pacing_rate 16.7Gbps delivery_rate 2.74Gbps
> delivered:1036 app_limited busy:437ms rcv_rtt:207.551 rcv_space:96242
> rcv_ssthresh:903417 minrtt:0.049 rcv_ooopack:23 snd_wnd:142336 rcv_wnd:7168
>
> This would indicate that the buffer size not growing while in this
> state, therefore limiting the rcv_wnd?

Can you give us (on receive side) : cat /proc/sys/net/ipv4/tcp_rmem

It seems your application is enforcing a small SO_RCVBUF ?


I would take a look at

ecfea98b7d0d tcp: add net.ipv4.tcp_rcvbuf_low_rtt
416dd649f3aa tcp: add net.ipv4.tcp_comp_sack_rtt_percent
aa251c84636c tcp: fix too slow tcp_rcvbuf_grow() action

After applying these patches, you can on the receiver :

perf record -a -e tcp:tcp_rcvbuf_grow sleep 30 ; perf script

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 7/8] tcp: stronger sk_rcvbuf checks
  2025-12-18 10:10       ` Eric Dumazet
@ 2025-12-18 12:28         ` Christian Ebner
  2025-12-18 13:19           ` Eric Dumazet
  0 siblings, 1 reply; 40+ messages in thread
From: Christian Ebner @ 2025-12-18 12:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, lkolbe

Hi Eric,

thank you for your reply!

On 12/18/25 11:10 AM, Eric Dumazet wrote:
> Can you give us (on receive side) : cat /proc/sys/net/ipv4/tcp_rmem

Affected users report they have the respective kernels defaults set, so:
- "4096 131072 6291456"  for v.617 builds
- "4096 131072 33554432" with the bumped max value of 32M for v6.18 builds

> It seems your application is enforcing a small SO_RCVBUF ?

No, we can exclude that since the output of `ss -tim` show the default 
buffer size after connection being established and growing up to the max 
value during traffic (backups being performed).

Might out-of-order packets and small (us scale) RTTs play a role?
`ss` reports `rcv_ooopack` when stale, the great majority of users 
having MTU 9000 (default seems to reduce the likelihood of this 
happening as well).

> I would take a look at
> 
> ecfea98b7d0d tcp: add net.ipv4.tcp_rcvbuf_low_rtt
> 416dd649f3aa tcp: add net.ipv4.tcp_comp_sack_rtt_percent
> aa251c84636c tcp: fix too slow tcp_rcvbuf_grow() action

Thanks a lot for the hints, we did already provide a test build with 
commit aa251c84636c cherry-picked on top of 6.17.11 to affected users, 
but they were still running into stale connections.
So while this (and most likely the increased `tcp_rmem[2]` default) 
seems to reduce the likelihood of stalls occurring, it does not fix them.

> After applying these patches, you can on the receiver :
> 
> perf record -a -e tcp:tcp_rcvbuf_grow sleep 30 ; perf script

We now provided test builds with mentioned commits cherry-picked as well 
and further asked for users to test with v6.18.1 stable.

Let me get back to you with requested traces and test results.

Best regards,
Christian Ebner

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 7/8] tcp: stronger sk_rcvbuf checks
  2025-12-18 12:28         ` Christian Ebner
@ 2025-12-18 13:19           ` Eric Dumazet
  2025-12-18 14:58             ` Christian Ebner
  0 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2025-12-18 13:19 UTC (permalink / raw)
  To: Christian Ebner
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, lkolbe

On Thu, Dec 18, 2025 at 1:28 PM Christian Ebner <c.ebner@proxmox.com> wrote:
>
> Hi Eric,
>
> thank you for your reply!
>
> On 12/18/25 11:10 AM, Eric Dumazet wrote:
> > Can you give us (on receive side) : cat /proc/sys/net/ipv4/tcp_rmem
>
> Affected users report they have the respective kernels defaults set, so:
> - "4096 131072 6291456"  for v.617 builds
> - "4096 131072 33554432" with the bumped max value of 32M for v6.18 builds
>
> > It seems your application is enforcing a small SO_RCVBUF ?
>
> No, we can exclude that since the output of `ss -tim` show the default
> buffer size after connection being established and growing up to the max
> value during traffic (backups being performed).
>

The trace you provided seems to show a very different picture ?

[::ffff:10.xx.xx.aa]:8007
       [::ffff:10.xx.xx.bb]:55554
          skmem:(r0,rb7488,t0,tb332800,f0,w0,o0,bl0,d20) cubic
wscale:10,10 rto:201 rtt:0.085/0.015 ato:40 mss:8948 pmtu:9000
rcvmss:7168 advmss:8948 cwnd:10 bytes_sent:937478 bytes_acked:937478
bytes_received:1295747055 segs_out:301010 segs_in:162410
data_segs_out:1035 data_segs_in:161588 send 8.42Gbps lastsnd:3308
lastrcv:191 lastack:191 pacing_rate 16.7Gbps delivery_rate 2.74Gbps
delivered:1036 app_limited busy:437ms rcv_rtt:207.551 rcv_space:96242
rcv_ssthresh:903417 minrtt:0.049 rcv_ooopack:23 snd_wnd:142336 rcv_wnd:7168

rb7488 would suggest the application has played with a very small SO_RCVBUF,
or some memory allocation constraint (memcg ?)

> Might out-of-order packets and small (us scale) RTTs play a role?
> `ss` reports `rcv_ooopack` when stale, the great majority of users
> having MTU 9000 (default seems to reduce the likelihood of this
> happening as well).
>
> > I would take a look at
> >
> > ecfea98b7d0d tcp: add net.ipv4.tcp_rcvbuf_low_rtt
> > 416dd649f3aa tcp: add net.ipv4.tcp_comp_sack_rtt_percent
> > aa251c84636c tcp: fix too slow tcp_rcvbuf_grow() action
>
> Thanks a lot for the hints, we did already provide a test build with
> commit aa251c84636c cherry-picked on top of 6.17.11 to affected users,
> but they were still running into stale connections.
> So while this (and most likely the increased `tcp_rmem[2]` default)
> seems to reduce the likelihood of stalls occurring, it does not fix them.
>
> > After applying these patches, you can on the receiver :
> >
> > perf record -a -e tcp:tcp_rcvbuf_grow sleep 30 ; perf script
>
> We now provided test builds with mentioned commits cherry-picked as well
> and further asked for users to test with v6.18.1 stable.
>
> Let me get back to you with requested traces and test results.
>
> Best regards,
> Christian Ebner
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 7/8] tcp: stronger sk_rcvbuf checks
  2025-12-18 13:19           ` Eric Dumazet
@ 2025-12-18 14:58             ` Christian Ebner
  2025-12-19  8:23               ` Eric Dumazet
  0 siblings, 1 reply; 40+ messages in thread
From: Christian Ebner @ 2025-12-18 14:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, lkolbe

On 12/18/25 2:19 PM, Eric Dumazet wrote:
> On Thu, Dec 18, 2025 at 1:28 PM Christian Ebner <c.ebner@proxmox.com> wrote:
>>
>> Hi Eric,
>>
>> thank you for your reply!
>>
>> On 12/18/25 11:10 AM, Eric Dumazet wrote:
>>> Can you give us (on receive side) : cat /proc/sys/net/ipv4/tcp_rmem
>>
>> Affected users report they have the respective kernels defaults set, so:
>> - "4096 131072 6291456"  for v.617 builds
>> - "4096 131072 33554432" with the bumped max value of 32M for v6.18 builds
>>
>>> It seems your application is enforcing a small SO_RCVBUF ?
>>
>> No, we can exclude that since the output of `ss -tim` show the default
>> buffer size after connection being established and growing up to the max
>> value during traffic (backups being performed).
>>
> 
> The trace you provided seems to show a very different picture ?
> 
> [::ffff:10.xx.xx.aa]:8007
>         [::ffff:10.xx.xx.bb]:55554
>            skmem:(r0,rb7488,t0,tb332800,f0,w0,o0,bl0,d20) cubic
> wscale:10,10 rto:201 rtt:0.085/0.015 ato:40 mss:8948 pmtu:9000
> rcvmss:7168 advmss:8948 cwnd:10 bytes_sent:937478 bytes_acked:937478
> bytes_received:1295747055 segs_out:301010 segs_in:162410
> data_segs_out:1035 data_segs_in:161588 send 8.42Gbps lastsnd:3308
> lastrcv:191 lastack:191 pacing_rate 16.7Gbps delivery_rate 2.74Gbps
> delivered:1036 app_limited busy:437ms rcv_rtt:207.551 rcv_space:96242
> rcv_ssthresh:903417 minrtt:0.049 rcv_ooopack:23 snd_wnd:142336 rcv_wnd:7168
> 
> rb7488 would suggest the application has played with a very small SO_RCVBUF,
> or some memory allocation constraint (memcg ?)

Thanks for the hint were to look, however we checked that the process is 
not memory constrained and the host has no memory pressure.

Also `strace -f -e socket,setsockopt -p $(pidof proxmox-backup-proxy)` 
shows no syscalls which would change the socket buffer size (though this 
still needs to be double checked by affected users for completeness).

Further, the stalls most often happen mid transfer, starting with the 
expected throughput and even might recover from the stall after some 
time, continue at regular speed again.


Status update for v6.18
-----------------------

In the meantime, a user reported 2 stale connections with running kernel 
6.18+416dd649f3aa

The tcpdump pattern looks slightly different, here we got repeating 
sequences of:
```
224	5.407981	10.xx.xx.bb	10.xx.xx.aa	TCP	4162	40068 → 8007 [PSH, ACK] 
Seq=106497 Ack=1 Win=3121 Len=4096 TSval=3198115973 TSecr=3048094015
225	5.408064	10.xx.xx.aa	10.xx.xx.bb	TCP	66	8007 → 40068 [ACK] Seq=1 
Ack=110593 Win=4 Len=0 TSval=3048094223 TSecr=3198115973
```

The perf trace for `tcp:tcp_rcvbuf_grow` came back empty while in stale 
state, tracing with:
```
perf record -a -e tcp:tcp_rcv_space_adjust,tcp:tcp_rcvbuf_grow
perf script
```
produced some output as shown below, so it seems that tcp_rcvbuf_grow() 
is never called in that case, while tcp_rcv_space_adjust() is.

```
  tokio-runtime-w    4930 [002]  6094.017275: tcp:tcp_rcv_space_adjust: 
family=AF_INET6 sport=8007 dport=40068 saddr=10.xx.xx.aa 
daddr=10.xx.xx.bb saddrv6=::ffff:10.xx.xx.aa daddrv6=::ffff:10.xx.xx.bb 
sock_cookie=101a
  tokio-runtime-w    4930 [002]  6094.187083: tcp:tcp_rcv_space_adjust: 
family=AF_INET6 sport=8007 dport=49944 saddr=10.xx.xx.aa 
daddr=10.xx.xx.bb saddrv6=::ffff:10.xx.xx.aa daddrv6=::ffff:10.xx.xx.bb 
sock_cookie=2
```

ss -tim
```
ESTAB 0      0      [::ffff:10.xx.xx.aa]:8007 [::ffff:10.xx.xx.bb]:40068
          skmem:(r0,rb4352,t0,tb332800,f0,w0,o0,bl0,d199) cubic 
wscale:7,10 rto:201 rtt:0.093/0.025 ato:40 mss:8948 pmtu:9000 
rcvmss:4096 advmss:8948 cwnd:10 ssthresh:16 bytes_sent:451949 
bytes_acked:451949 bytes_received:805775577 segs_out:59050 segs_in:72440 
data_segs_out:392 data_segs_in:72287 send 7.7Gbps lastsnd:75880 
lastrcv:167 lastack:167 pacing_rate 9.16Gbps delivery_rate 2.09Gbps 
delivered:393 app_limited busy:756ms rcv_rtt:207.343 rcv_space:107600 
rcv_ssthresh:312270 minrtt:0.055 rcv_ooopack:287 snd_wnd:399488 rcv_wnd:4096
ESTAB 0      0      [::ffff:10.xx.xx.aa]:8007 [::ffff:10.xx.xx.bb]:49944
          skmem:(r0,rb4352,t0,tb332800,f0,w0,o0,bl0,d286) cubic 
wscale:7,10 rto:201 rtt:0.213/0.266 ato:40 mss:8948 pmtu:9000 
rcvmss:4096 advmss:8948 cwnd:10 ssthresh:17 bytes_sent:1255369 
bytes_acked:1255369 bytes_received:55175665 segs_out:11516 segs_in:8473 
data_segs_out:354 data_segs_in:8038 send 3.36Gbps lastsnd:111496 
lastrcv:14 lastack:14 pacing_rate 4.03Gbps delivery_rate 2.42Gbps 
delivered:355 busy:103ms rcv_rtt:207.596 rcv_space:79779 
rcv_ssthresh:198722 minrtt:0.07 rcv_ooopack:6 snd_wnd:439552 rcv_wnd:4096
```

Best regards,
Christian Ebner


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 7/8] tcp: stronger sk_rcvbuf checks
  2025-12-18 14:58             ` Christian Ebner
@ 2025-12-19  8:23               ` Eric Dumazet
  2025-12-19  8:45                 ` Eric Dumazet
  0 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2025-12-19  8:23 UTC (permalink / raw)
  To: Christian Ebner
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, lkolbe

On Thu, Dec 18, 2025 at 3:58 PM Christian Ebner <c.ebner@proxmox.com> wrote:
>
> On 12/18/25 2:19 PM, Eric Dumazet wrote:
> > On Thu, Dec 18, 2025 at 1:28 PM Christian Ebner <c.ebner@proxmox.com> wrote:
> >>
> >> Hi Eric,
> >>
> >> thank you for your reply!
> >>
> >> On 12/18/25 11:10 AM, Eric Dumazet wrote:
> >>> Can you give us (on receive side) : cat /proc/sys/net/ipv4/tcp_rmem
> >>
> >> Affected users report they have the respective kernels defaults set, so:
> >> - "4096 131072 6291456"  for v.617 builds
> >> - "4096 131072 33554432" with the bumped max value of 32M for v6.18 builds
> >>
> >>> It seems your application is enforcing a small SO_RCVBUF ?
> >>
> >> No, we can exclude that since the output of `ss -tim` show the default
> >> buffer size after connection being established and growing up to the max
> >> value during traffic (backups being performed).
> >>
> >
> > The trace you provided seems to show a very different picture ?
> >
> > [::ffff:10.xx.xx.aa]:8007
> >         [::ffff:10.xx.xx.bb]:55554
> >            skmem:(r0,rb7488,t0,tb332800,f0,w0,o0,bl0,d20) cubic
> > wscale:10,10 rto:201 rtt:0.085/0.015 ato:40 mss:8948 pmtu:9000
> > rcvmss:7168 advmss:8948 cwnd:10 bytes_sent:937478 bytes_acked:937478
> > bytes_received:1295747055 segs_out:301010 segs_in:162410
> > data_segs_out:1035 data_segs_in:161588 send 8.42Gbps lastsnd:3308
> > lastrcv:191 lastack:191 pacing_rate 16.7Gbps delivery_rate 2.74Gbps
> > delivered:1036 app_limited busy:437ms rcv_rtt:207.551 rcv_space:96242
> > rcv_ssthresh:903417 minrtt:0.049 rcv_ooopack:23 snd_wnd:142336 rcv_wnd:7168
> >
> > rb7488 would suggest the application has played with a very small SO_RCVBUF,
> > or some memory allocation constraint (memcg ?)
>
> Thanks for the hint were to look, however we checked that the process is
> not memory constrained and the host has no memory pressure.
>
> Also `strace -f -e socket,setsockopt -p $(pidof proxmox-backup-proxy)`
> shows no syscalls which would change the socket buffer size (though this
> still needs to be double checked by affected users for completeness).
>
> Further, the stalls most often happen mid transfer, starting with the
> expected throughput and even might recover from the stall after some
> time, continue at regular speed again.
>
>
> Status update for v6.18
> -----------------------
>
> In the meantime, a user reported 2 stale connections with running kernel
> 6.18+416dd649f3aa
>
> The tcpdump pattern looks slightly different, here we got repeating
> sequences of:
> ```
> 224     5.407981        10.xx.xx.bb     10.xx.xx.aa     TCP     4162    40068 → 8007 [PSH, ACK]
> Seq=106497 Ack=1 Win=3121 Len=4096 TSval=3198115973 TSecr=3048094015
> 225     5.408064        10.xx.xx.aa     10.xx.xx.bb     TCP     66      8007 → 40068 [ACK] Seq=1
> Ack=110593 Win=4 Len=0 TSval=3048094223 TSecr=3198115973
> ```
>
> The perf trace for `tcp:tcp_rcvbuf_grow` came back empty while in stale
> state, tracing with:
> ```
> perf record -a -e tcp:tcp_rcv_space_adjust,tcp:tcp_rcvbuf_grow
> perf script
> ```
> produced some output as shown below, so it seems that tcp_rcvbuf_grow()
> is never called in that case, while tcp_rcv_space_adjust() is.

Autotuning is not enabled for your case, somehow the application is
not behaving as expected,
so maybe you have to change tcp_rmem[2] if a driver is allocating
order-2 pages for the 9K frames.

You have not given what  was on the sender side (linux or other stack ?)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 7/8] tcp: stronger sk_rcvbuf checks
  2025-12-19  8:23               ` Eric Dumazet
@ 2025-12-19  8:45                 ` Eric Dumazet
  2025-12-19 10:00                   ` Christian Ebner
  0 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2025-12-19  8:45 UTC (permalink / raw)
  To: Christian Ebner
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, lkolbe

On Fri, Dec 19, 2025 at 9:23 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Dec 18, 2025 at 3:58 PM Christian Ebner <c.ebner@proxmox.com> wrote:
> >
> > On 12/18/25 2:19 PM, Eric Dumazet wrote:
> > > On Thu, Dec 18, 2025 at 1:28 PM Christian Ebner <c.ebner@proxmox.com> wrote:
> > >>
> > >> Hi Eric,
> > >>
> > >> thank you for your reply!
> > >>
> > >> On 12/18/25 11:10 AM, Eric Dumazet wrote:
> > >>> Can you give us (on receive side) : cat /proc/sys/net/ipv4/tcp_rmem
> > >>
> > >> Affected users report they have the respective kernels defaults set, so:
> > >> - "4096 131072 6291456"  for v.617 builds
> > >> - "4096 131072 33554432" with the bumped max value of 32M for v6.18 builds
> > >>
> > >>> It seems your application is enforcing a small SO_RCVBUF ?
> > >>
> > >> No, we can exclude that since the output of `ss -tim` show the default
> > >> buffer size after connection being established and growing up to the max
> > >> value during traffic (backups being performed).
> > >>
> > >
> > > The trace you provided seems to show a very different picture ?
> > >
> > > [::ffff:10.xx.xx.aa]:8007
> > >         [::ffff:10.xx.xx.bb]:55554
> > >            skmem:(r0,rb7488,t0,tb332800,f0,w0,o0,bl0,d20) cubic
> > > wscale:10,10 rto:201 rtt:0.085/0.015 ato:40 mss:8948 pmtu:9000
> > > rcvmss:7168 advmss:8948 cwnd:10 bytes_sent:937478 bytes_acked:937478
> > > bytes_received:1295747055 segs_out:301010 segs_in:162410
> > > data_segs_out:1035 data_segs_in:161588 send 8.42Gbps lastsnd:3308
> > > lastrcv:191 lastack:191 pacing_rate 16.7Gbps delivery_rate 2.74Gbps
> > > delivered:1036 app_limited busy:437ms rcv_rtt:207.551 rcv_space:96242
> > > rcv_ssthresh:903417 minrtt:0.049 rcv_ooopack:23 snd_wnd:142336 rcv_wnd:7168
> > >
> > > rb7488 would suggest the application has played with a very small SO_RCVBUF,
> > > or some memory allocation constraint (memcg ?)
> >
> > Thanks for the hint were to look, however we checked that the process is
> > not memory constrained and the host has no memory pressure.
> >
> > Also `strace -f -e socket,setsockopt -p $(pidof proxmox-backup-proxy)`
> > shows no syscalls which would change the socket buffer size (though this
> > still needs to be double checked by affected users for completeness).
> >
> > Further, the stalls most often happen mid transfer, starting with the
> > expected throughput and even might recover from the stall after some
> > time, continue at regular speed again.
> >
> >
> > Status update for v6.18
> > -----------------------
> >
> > In the meantime, a user reported 2 stale connections with running kernel
> > 6.18+416dd649f3aa
> >
> > The tcpdump pattern looks slightly different, here we got repeating
> > sequences of:
> > ```
> > 224     5.407981        10.xx.xx.bb     10.xx.xx.aa     TCP     4162    40068 → 8007 [PSH, ACK]
> > Seq=106497 Ack=1 Win=3121 Len=4096 TSval=3198115973 TSecr=3048094015
> > 225     5.408064        10.xx.xx.aa     10.xx.xx.bb     TCP     66      8007 → 40068 [ACK] Seq=1
> > Ack=110593 Win=4 Len=0 TSval=3048094223 TSecr=3198115973
> > ```
> >
> > The perf trace for `tcp:tcp_rcvbuf_grow` came back empty while in stale
> > state, tracing with:
> > ```
> > perf record -a -e tcp:tcp_rcv_space_adjust,tcp:tcp_rcvbuf_grow
> > perf script
> > ```
> > produced some output as shown below, so it seems that tcp_rcvbuf_grow()
> > is never called in that case, while tcp_rcv_space_adjust() is.
>
> Autotuning is not enabled for your case, somehow the application is
> not behaving as expected,
> so maybe you have to change tcp_rmem[2] if a driver is allocating
> order-2 pages for the 9K frames.

I meant to say : change tcp_rmem[1]

echo "4096 262144 33554432" >/proc/sys/net/ipv4/tcp_rmem

>
> You have not given what  was on the sender side (linux or other stack ?)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 7/8] tcp: stronger sk_rcvbuf checks
  2025-12-19  8:45                 ` Eric Dumazet
@ 2025-12-19 10:00                   ` Christian Ebner
  2025-12-19 10:12                     ` Eric Dumazet
  0 siblings, 1 reply; 40+ messages in thread
From: Christian Ebner @ 2025-12-19 10:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, lkolbe

On 12/19/25 9:45 AM, Eric Dumazet wrote:
> On Fri, Dec 19, 2025 at 9:23 AM Eric Dumazet <edumazet@google.com> wrote:
>>
>> On Thu, Dec 18, 2025 at 3:58 PM Christian Ebner <c.ebner@proxmox.com> wrote:
>>>
>>> On 12/18/25 2:19 PM, Eric Dumazet wrote:
>>>> On Thu, Dec 18, 2025 at 1:28 PM Christian Ebner <c.ebner@proxmox.com> wrote:
>>>>>
>>>>> Hi Eric,
>>>>>
>>>>> thank you for your reply!
>>>>>
>>>>> On 12/18/25 11:10 AM, Eric Dumazet wrote:
>>>>>> Can you give us (on receive side) : cat /proc/sys/net/ipv4/tcp_rmem
>>>>>
>>>>> Affected users report they have the respective kernels defaults set, so:
>>>>> - "4096 131072 6291456"  for v.617 builds
>>>>> - "4096 131072 33554432" with the bumped max value of 32M for v6.18 builds
>>>>>
>>>>>> It seems your application is enforcing a small SO_RCVBUF ?
>>>>>
>>>>> No, we can exclude that since the output of `ss -tim` show the default
>>>>> buffer size after connection being established and growing up to the max
>>>>> value during traffic (backups being performed).
>>>>>
>>>>
>>>> The trace you provided seems to show a very different picture ?
>>>>
>>>> [::ffff:10.xx.xx.aa]:8007
>>>>          [::ffff:10.xx.xx.bb]:55554
>>>>             skmem:(r0,rb7488,t0,tb332800,f0,w0,o0,bl0,d20) cubic
>>>> wscale:10,10 rto:201 rtt:0.085/0.015 ato:40 mss:8948 pmtu:9000
>>>> rcvmss:7168 advmss:8948 cwnd:10 bytes_sent:937478 bytes_acked:937478
>>>> bytes_received:1295747055 segs_out:301010 segs_in:162410
>>>> data_segs_out:1035 data_segs_in:161588 send 8.42Gbps lastsnd:3308
>>>> lastrcv:191 lastack:191 pacing_rate 16.7Gbps delivery_rate 2.74Gbps
>>>> delivered:1036 app_limited busy:437ms rcv_rtt:207.551 rcv_space:96242
>>>> rcv_ssthresh:903417 minrtt:0.049 rcv_ooopack:23 snd_wnd:142336 rcv_wnd:7168
>>>>
>>>> rb7488 would suggest the application has played with a very small SO_RCVBUF,
>>>> or some memory allocation constraint (memcg ?)
>>>
>>> Thanks for the hint were to look, however we checked that the process is
>>> not memory constrained and the host has no memory pressure.
>>>
>>> Also `strace -f -e socket,setsockopt -p $(pidof proxmox-backup-proxy)`
>>> shows no syscalls which would change the socket buffer size (though this
>>> still needs to be double checked by affected users for completeness).
>>>
>>> Further, the stalls most often happen mid transfer, starting with the
>>> expected throughput and even might recover from the stall after some
>>> time, continue at regular speed again.
>>>
>>>
>>> Status update for v6.18
>>> -----------------------
>>>
>>> In the meantime, a user reported 2 stale connections with running kernel
>>> 6.18+416dd649f3aa
>>>
>>> The tcpdump pattern looks slightly different, here we got repeating
>>> sequences of:
>>> ```
>>> 224     5.407981        10.xx.xx.bb     10.xx.xx.aa     TCP     4162    40068 → 8007 [PSH, ACK]
>>> Seq=106497 Ack=1 Win=3121 Len=4096 TSval=3198115973 TSecr=3048094015
>>> 225     5.408064        10.xx.xx.aa     10.xx.xx.bb     TCP     66      8007 → 40068 [ACK] Seq=1
>>> Ack=110593 Win=4 Len=0 TSval=3048094223 TSecr=3198115973
>>> ```
>>>
>>> The perf trace for `tcp:tcp_rcvbuf_grow` came back empty while in stale
>>> state, tracing with:
>>> ```
>>> perf record -a -e tcp:tcp_rcv_space_adjust,tcp:tcp_rcvbuf_grow
>>> perf script
>>> ```
>>> produced some output as shown below, so it seems that tcp_rcvbuf_grow()
>>> is never called in that case, while tcp_rcv_space_adjust() is.
>>
>> Autotuning is not enabled for your case, somehow the application is
>> not behaving as expected,

Is there a way for us to check if autotuning is enabled for the TCP 
connection at this point in time? Some tracepoint to identify it being 
deactivated?

>> so maybe you have to change tcp_rmem[2] if a driver is allocating
>> order-2 pages for the 9K frames.

Same here, is there a way for us to check this? Note however that we 
could not identify a specific NIC/driver to cause the behavior, it 
appears for various vendor models.

> 
> I meant to say : change tcp_rmem[1]
> 
> echo "4096 262144 33554432" >/proc/sys/net/ipv4/tcp_rmem

Okay, thanks for the suggestion, let me get back to you with results if 
this changes anything.


>> You have not given what  was on the sender side (linux or other stack ?)

Clients are all Linux hosts, running kernel versions 6.8, 6.14 or 6.17. 
No other TCP stacks.

Best regards,
Christian Ebner


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH net-next 7/8] tcp: stronger sk_rcvbuf checks
  2025-12-19 10:00                   ` Christian Ebner
@ 2025-12-19 10:12                     ` Eric Dumazet
  0 siblings, 0 replies; 40+ messages in thread
From: Eric Dumazet @ 2025-12-19 10:12 UTC (permalink / raw)
  To: Christian Ebner
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, lkolbe

On Fri, Dec 19, 2025 at 11:00 AM Christian Ebner <c.ebner@proxmox.com> wrote:
>
> On 12/19/25 9:45 AM, Eric Dumazet wrote:
> > On Fri, Dec 19, 2025 at 9:23 AM Eric Dumazet <edumazet@google.com> wrote:
> >>
> >> On Thu, Dec 18, 2025 at 3:58 PM Christian Ebner <c.ebner@proxmox.com> wrote:
> >>>
> >>> On 12/18/25 2:19 PM, Eric Dumazet wrote:
> >>>> On Thu, Dec 18, 2025 at 1:28 PM Christian Ebner <c.ebner@proxmox.com> wrote:
> >>>>>
> >>>>> Hi Eric,
> >>>>>
> >>>>> thank you for your reply!
> >>>>>
> >>>>> On 12/18/25 11:10 AM, Eric Dumazet wrote:
> >>>>>> Can you give us (on receive side) : cat /proc/sys/net/ipv4/tcp_rmem
> >>>>>
> >>>>> Affected users report they have the respective kernels defaults set, so:
> >>>>> - "4096 131072 6291456"  for v.617 builds
> >>>>> - "4096 131072 33554432" with the bumped max value of 32M for v6.18 builds
> >>>>>
> >>>>>> It seems your application is enforcing a small SO_RCVBUF ?
> >>>>>
> >>>>> No, we can exclude that since the output of `ss -tim` show the default
> >>>>> buffer size after connection being established and growing up to the max
> >>>>> value during traffic (backups being performed).
> >>>>>
> >>>>
> >>>> The trace you provided seems to show a very different picture ?
> >>>>
> >>>> [::ffff:10.xx.xx.aa]:8007
> >>>>          [::ffff:10.xx.xx.bb]:55554
> >>>>             skmem:(r0,rb7488,t0,tb332800,f0,w0,o0,bl0,d20) cubic
> >>>> wscale:10,10 rto:201 rtt:0.085/0.015 ato:40 mss:8948 pmtu:9000
> >>>> rcvmss:7168 advmss:8948 cwnd:10 bytes_sent:937478 bytes_acked:937478
> >>>> bytes_received:1295747055 segs_out:301010 segs_in:162410
> >>>> data_segs_out:1035 data_segs_in:161588 send 8.42Gbps lastsnd:3308
> >>>> lastrcv:191 lastack:191 pacing_rate 16.7Gbps delivery_rate 2.74Gbps
> >>>> delivered:1036 app_limited busy:437ms rcv_rtt:207.551 rcv_space:96242
> >>>> rcv_ssthresh:903417 minrtt:0.049 rcv_ooopack:23 snd_wnd:142336 rcv_wnd:7168
> >>>>
> >>>> rb7488 would suggest the application has played with a very small SO_RCVBUF,
> >>>> or some memory allocation constraint (memcg ?)
> >>>
> >>> Thanks for the hint were to look, however we checked that the process is
> >>> not memory constrained and the host has no memory pressure.
> >>>
> >>> Also `strace -f -e socket,setsockopt -p $(pidof proxmox-backup-proxy)`
> >>> shows no syscalls which would change the socket buffer size (though this
> >>> still needs to be double checked by affected users for completeness).
> >>>
> >>> Further, the stalls most often happen mid transfer, starting with the
> >>> expected throughput and even might recover from the stall after some
> >>> time, continue at regular speed again.
> >>>
> >>>
> >>> Status update for v6.18
> >>> -----------------------
> >>>
> >>> In the meantime, a user reported 2 stale connections with running kernel
> >>> 6.18+416dd649f3aa
> >>>
> >>> The tcpdump pattern looks slightly different, here we got repeating
> >>> sequences of:
> >>> ```
> >>> 224     5.407981        10.xx.xx.bb     10.xx.xx.aa     TCP     4162    40068 → 8007 [PSH, ACK]
> >>> Seq=106497 Ack=1 Win=3121 Len=4096 TSval=3198115973 TSecr=3048094015
> >>> 225     5.408064        10.xx.xx.aa     10.xx.xx.bb     TCP     66      8007 → 40068 [ACK] Seq=1
> >>> Ack=110593 Win=4 Len=0 TSval=3048094223 TSecr=3198115973
> >>> ```
> >>>
> >>> The perf trace for `tcp:tcp_rcvbuf_grow` came back empty while in stale
> >>> state, tracing with:
> >>> ```
> >>> perf record -a -e tcp:tcp_rcv_space_adjust,tcp:tcp_rcvbuf_grow
> >>> perf script
> >>> ```
> >>> produced some output as shown below, so it seems that tcp_rcvbuf_grow()
> >>> is never called in that case, while tcp_rcv_space_adjust() is.
> >>
> >> Autotuning is not enabled for your case, somehow the application is
> >> not behaving as expected,
>
> Is there a way for us to check if autotuning is enabled for the TCP
> connection at this point in time? Some tracepoint to identify it being
> deactivated?

tcp_rcv_space_adjust() has a tracepoint.

You can also use bpftrace to collect more fields from TCP sockets.

If trace_tcp_rcvbuf_grow() is not called, then the application drains
its receive queue too slowly
for autotune to quick in, or the sender is limited.


>
> >> so maybe you have to change tcp_rmem[2] if a driver is allocating
> >> order-2 pages for the 9K frames.
>
> Same here, is there a way for us to check this? Note however that we
> could not identify a specific NIC/driver to cause the behavior, it
> appears for various vendor models.

I don't have this issue using regular tcp_stream tests and 9K traffic.
Can you try standard programs instead of in-house ones ?
(netperf, neper, iperf3...)

Use a bpftrace program to gather tp->scaling_ratio

bpftrace -e '
k:tcp_rcv_space_adjust {
  $sk = (struct sock *)arg0;
  if ($sk->sk_rcvbuf > 20000) { return ; }
  $tp = (struct tcp_sock *)arg0;
  @scaling[$tp->scaling_ratio] = count();
}
'


>
> >
> > I meant to say : change tcp_rmem[1]
> >
> > echo "4096 262144 33554432" >/proc/sys/net/ipv4/tcp_rmem
>
> Okay, thanks for the suggestion, let me get back to you with results if
> this changes anything.
>
>
> >> You have not given what  was on the sender side (linux or other stack ?)
>
> Clients are all Linux hosts, running kernel versions 6.8, 6.14 or 6.17.
> No other TCP stacks.
>
> Best regards,
> Christian Ebner
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [regression] [PATCH net-next 7/8] tcp: stronger sk_rcvbuf checks
  2025-07-11 11:40 ` [PATCH net-next 7/8] tcp: stronger sk_rcvbuf checks Eric Dumazet
  2025-07-12 21:54   ` Kuniyuki Iwashima
  2025-12-15 10:19   ` Christian Ebner
@ 2026-01-25 21:11   ` Simon Baatz
  2 siblings, 0 replies; 40+ messages in thread
From: Simon Baatz @ 2026-01-25 21:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Neal Cardwell,
	Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet, c.ebner

Hi,

I am seeing a regression in the Valkey test suite with kernels >=
6.17. A bisection points to 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf
checks"), but my impression is that this change mostly makes an
underlying issue surface earlier. Additionally, 9ca48d616e ("tcp: do
not accept packets beyond window") seems to make the problem even
worse.

Valkey test scenario:

Test client opens a connection and sends a few MB worth of Valkey
commands (one write per command). The client does not perform any
reads until all commands have been sent.

Valkey server accepts the connection, enables TCP_NODELAY, and uses
non-blocking I/O to read/write. It writes a response for each command
and buffers data internally when required.

The connection runs over the loopback interface with MTU 65536. In
most cases this connection ends up stuck. The system is otherwise idle
and has plenty of free memory.

I have a a small reproducer where the client continously sends
commands and never reads. The server sends 127 bytes after reading
data (works for any payload < 2^wscale. 127 fills the buffer
fastest).

Trigger warning: reproducer code generated using AI:
https://gist.github.com/gmbnomis/0b75b2b88f49dc33d6c38ac23120b1e3

Here is a run on 6.19.0-rc6 using virtme-ng (server on port 7000, wscale is 7):

    1   0.000000    127.0.0.1 â†’ 127.0.0.1    TCP 74 37532 â†’ 7000 [SYN] Seq=0 Win=65495 Len=0 MSS=65495 SACK_PERM TSval=4155167414 TSecr=0 WS=128
    2   0.000080    127.0.0.1 â†’ 127.0.0.1    TCP 74 7000 â†’ 37532 [SYN, ACK] Seq=0 Ack=1 Win=65483 Len=0 MSS=65495 SACK_PERM TSval=4155167415 TSecr=4155167414 WS=128

[...]

At this point the client is still advertising a large receive window
although we run out of receive buffer space. This happens because
window scaling is used and we must not shrink the window (see
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/).
Before 9ca48d616e ("tcp: do not accept packets beyond window"), the
stack would accept significantly more (up to max rmem, on 6.16 this
happens after approx 5MB with standard rmem buffer settings):
 1510   0.021964    127.0.0.1 â†’ 127.0.0.1    TCP 226 37532 â†’ 7000 [PSH, ACK] Seq=152225 Ack=95505 Win=65536 Len=160 TSval=4155167436 TSecr=4155167436
 1511   0.021977    127.0.0.1 â†’ 127.0.0.1    TCP 193 7000 â†’ 37532 [PSH, ACK] Seq=95505 Ack=152385 Win=191104 Len=127 TSval=4155167436 TSecr=4155167436
 
Out of memory: drop packet #1511. Since e2142825c120 ("net: tcp: send
zero-window ACK when no memory") the adv. window is set to zero. Since
8c670bdfa58e ("tcp: correct handling of extreme memory squeeze") the
right edge of the window actually moves to the left (to 95505):
 1512   0.021987    127.0.0.1 â†’ 127.0.0.1    TCP 226 [TCP ZeroWindow] 37532 â†’ 7000 [PSH, ACK] Seq=152385 Ack=95505 Win=0 Len=160 TSval=4155167436 TSecr=4155167436
 1513   0.029462    127.0.0.1 â†’ 127.0.0.1    TCP 65549 [TCP ZeroWindow] 37532 â†’ 7000 [ACK] Seq=152545 Ack=95505 Win=0 Len=65483 TSval=4155167444 TSecr=4155167436
 
The server has already sent data up to 95632, and continues with that
Seq, but from the clientâ€™s point of view this is now beyond the
receive window:
 1514   0.029847    127.0.0.1 â†’ 127.0.0.1    TCP 66 7000 â†’ 37532 [ACK] Seq=95632 Ack=218028 Win=191104 Len=0 TSval=4155167444 TSecr=4155167436
 
Send an ack for the ack #1514 because it is considered outside of the window:
 1515   0.029856    127.0.0.1 â†’ 127.0.0.1    TCP 66 [TCP ZeroWindow] 37532 â†’ 7000 [ACK] Seq=218028 Ack=95505 Win=0 Len=0 TSval=4155167444 TSecr=4155167436
 1516   0.033116    127.0.0.1 â†’ 127.0.0.1    TCP 42615 [TCP ZeroWindow] 37532 â†’ 7000 [PSH, ACK] Seq=218028 Ack=95505 Win=0 Len=42549 TSval=4155167448 TSecr=4155167436
 1517   0.037074    127.0.0.1 â†’ 127.0.0.1    TCP 65549 [TCP ZeroWindow] 37532 â†’ 7000 [ACK] Seq=260577 Ack=95505 Win=0 Len=65483 TSval=4155167451 TSecr=4155167436
 1518   0.037093    127.0.0.1 â†’ 127.0.0.1    TCP 66 7000 â†’ 37532 [ACK] Seq=95632 Ack=326060 Win=218240 Len=0 TSval=4155167452 TSecr=4155167448
 
All acks since #1514 are dropped. Thus, we see a retransmission of #1512.
1519   0.229190    127.0.0.1 â†’ 127.0.0.1    TCP 226 [TCP ZeroWindow] [TCP Spurious Retransmission] 37532 â†’ 7000 [PSH, ACK] Seq=152385 Ack=95505 Win=0 Len=160 TSval=4155167644 TSecr=4155167436
 
The server retransmits its last unacked segment. Since 9ca48d616e
"tcp: do not accept packets beyond window", this segment is dropped as
it extends beyond the window. (Before, this packet passed the sequence
number check and the client sent a window of fresh data on each
retransmission attempt (and only then)).
 1520   0.229201    127.0.0.1 â†’ 127.0.0.1    TCP 193 [TCP Retransmission] 7000 â†’ 37532 [PSH, ACK] Seq=95505 Ack=326060 Win=240896 Len=127 TSval=4155167644 TSecr=4155167448
 1521   0.229206    127.0.0.1 â†’ 127.0.0.1    TCP 78 [TCP Dup ACK 1518#1] 7000 â†’ 37532 [ACK] Seq=95632 Ack=326060 Win=240896 Len=0 TSval=4155167644 TSecr=4155167448 SLE=152385 SRE=152545
 
The connection is effectively stuck: Neither acks nor retranmissions
from the server are even being looked at.

I tried reverting 8c670bdfa58e ("tcp: correct handling of extreme
memory squeeze") on top of 6.19â€‘rc6. Connection does not hang, but is
broken from a protocol perspective:

0.018805    127.0.0.1 â†’ 127.0.0.1    TCP 226 [TCP ZeroWindow] 34358 â†’ 7000 [PSH, ACK] Seq=151425 Ack=95505 Win=0 Len=160 TSval=602409183 TSecr=602409183
0.024775    127.0.0.1 â†’ 127.0.0.1    TCP 65549 34358 â†’ 7000 [ACK] Seq=151585 Ack=95505 Win=65536 Len=65483 TSval=602409189 TSecr=602409183
0.024800    127.0.0.1 â†’ 127.0.0.1    TCP 193 7000 â†’ 34358 [PSH, ACK] Seq=95632 Ack=217068 Win=97408 Len=127 TSval=602409189 TSecr=602409183

When setting "net.ipv4.tcp_shrink_window=1", packets that cause the
window to shrink (to zero) are accepted (instead of dropping them). 
This helps in this particular scenario, since there is only one
packet in flight.  However, when there are still packets in flight at
the moment the window is closed, those packets are beyond the window
once they arrive (which is correct), but all further packets
sent by the server are regarded as beyond the window as well.

I am not sure what to make out of all of this. It seems that we cannot
always avoid shrinking the receive window (if window scaling is used).
Do we need to track the maximum advertised right edge for
sequence number validation?

- Simon

-- 
Simon Baatz <gmbnomis@gmail.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2026-01-25 21:11 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-11 11:39 [PATCH net-next 0/8] tcp: receiver changes Eric Dumazet
2025-07-11 11:39 ` [PATCH net-next 1/8] tcp: do not accept packets beyond window Eric Dumazet
2025-07-12 20:52   ` Kuniyuki Iwashima
2025-07-15  1:38   ` Jakub Kicinski
2025-07-11 11:40 ` [PATCH net-next 2/8] tcp: add LINUX_MIB_BEYOND_WINDOW Eric Dumazet
2025-07-12 20:55   ` Kuniyuki Iwashima
2025-07-11 11:40 ` [PATCH net-next 3/8] selftests/net: packetdrill: add tcp_rcv_big_endseq.pkt Eric Dumazet
2025-07-12 20:58   ` Kuniyuki Iwashima
2025-07-11 11:40 ` [PATCH net-next 4/8] tcp: call tcp_measure_rcv_mss() for ooo packets Eric Dumazet
2025-07-12 21:11   ` Kuniyuki Iwashima
2025-07-11 11:40 ` [PATCH net-next 5/8] selftests/net: packetdrill: add tcp_ooo_rcv_mss.pkt Eric Dumazet
2025-07-12 21:42   ` Kuniyuki Iwashima
2025-07-11 11:40 ` [PATCH net-next 6/8] tcp: add const to tcp_try_rmem_schedule() and sk_rmem_schedule() skb Eric Dumazet
2025-07-12 21:43   ` Kuniyuki Iwashima
2025-07-11 11:40 ` [PATCH net-next 7/8] tcp: stronger sk_rcvbuf checks Eric Dumazet
2025-07-12 21:54   ` Kuniyuki Iwashima
2025-12-15 10:19   ` Christian Ebner
2025-12-18  9:31     ` Christian Ebner
2025-12-18 10:10       ` Eric Dumazet
2025-12-18 12:28         ` Christian Ebner
2025-12-18 13:19           ` Eric Dumazet
2025-12-18 14:58             ` Christian Ebner
2025-12-19  8:23               ` Eric Dumazet
2025-12-19  8:45                 ` Eric Dumazet
2025-12-19 10:00                   ` Christian Ebner
2025-12-19 10:12                     ` Eric Dumazet
2026-01-25 21:11   ` [regression] " Simon Baatz
2025-07-11 11:40 ` [PATCH net-next 8/8] selftests/net: packetdrill: add tcp_rcv_toobig.pkt Eric Dumazet
2025-07-12 21:57   ` Kuniyuki Iwashima
2025-07-15  2:20 ` [PATCH net-next 0/8] tcp: receiver changes patchwork-bot+netdevbpf
2025-07-15  8:25 ` Paolo Abeni
2025-07-15  9:21   ` Matthieu Baerts
2025-07-15 10:14     ` Paolo Abeni
2025-07-15 10:40       ` Matthieu Baerts
2025-07-15 13:28       ` Jakub Kicinski
2025-07-15 13:33         ` Jakub Kicinski
2025-07-15 13:52           ` Paolo Abeni
2025-07-15 14:54             ` Jakub Kicinski
2025-07-15 14:48           ` Kuniyuki Iwashima
2025-07-15 13:50         ` Paolo Abeni

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.