netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
@ 2025-06-06  1:32 Eric Wheeler
  2025-06-06 17:16 ` Neal Cardwell
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Wheeler @ 2025-06-06  1:32 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Neal Cardwell,
	Sasha Levin, Yuchung Cheng, stable

Hello Neal,

After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
10GbE via one SFP+ DAC (no bonding), we found TCP performance with
existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
across the switch on 10Gbit ports runs at full 10GbE.

Interestingly, the problem only presents itself when transmitting 
from Linux; receive traffic (to Linux) performs just fine:
	~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE  -> device
	 ~1Gbit: device        =TX=>  1GbE -> switch -> 10GbE -> Linux v6.6.85

Through bisection, we found this first-bad commit:

	tcp: fix to allow timestamp undo if no retransmits were sent
		upstream: 	e37ab7373696e650d3b6262a5b882aadad69bb9e
		stable 6.6.y:	e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f

To validate the regression, we performed the procedures below using the 
latest versions of Linux. As you can see by comparing the performance 
measurements, it is 10-16x faster after reverting. This appears to affect 
everything after ~6.6.12-rc1 when the patch was introduced, as well as any 
stable releases that cherry-picked it. I have pasted the small commit that 
was reverted below for your reference.

Do you understand why it would behave this way, and what the correct fix 
(or possible workaround) would be? 

Currently we are able to reproduce this reliably, please let me know if 
you would like us to gather any additional information.

-Eric

# Testing v6.6.92

## Before Revert
- git checkout v6.6.92
- build, boot, test with `iperf -c <ip>`
	[  5] local 192.168.1.52 port 42886 connected to 192.168.1.203 port 5201
	[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
	[  5]   0.00-1.00   sec  41.5 MBytes   348 Mbits/sec  153    322 KBytes       
	[  5]   1.00-2.00   sec  3.68 MBytes  30.8 Mbits/sec  491    368 KBytes       
	[  5]   2.00-3.00   sec  3.00 MBytes  25.2 Mbits/sec  1477    425 KBytes       
	[  5]   3.00-4.00   sec  3.25 MBytes  27.2 Mbits/sec  1348   2.85 KBytes       
	[  5]   4.00-5.00   sec  3.43 MBytes  28.8 Mbits/sec  1875    498 KBytes       
	[  5]   5.00-6.00   sec  3.49 MBytes  29.3 Mbits/sec  1957    471 KBytes       
	[  5]   6.00-7.00   sec  2.48 MBytes  20.8 Mbits/sec  1463    538 KBytes       
	[  5]   7.00-8.00   sec  1.25 MBytes  10.5 Mbits/sec  1072    603 KBytes       
	[  5]   8.00-9.00   sec  3.71 MBytes  31.2 Mbits/sec  1362    593 KBytes       
	[  5]   9.00-10.00  sec  2.50 MBytes  21.0 Mbits/sec  1676    624 KBytes       
	- - - - - - - - - - - - - - - - - - - - - - - - -
	[ ID] Interval           Transfer     Bitrate         Retr
	[  5]   0.00-10.00  sec  68.3 MBytes  57.3 Mbits/sec  12874             sender <<<
	[  5]   0.00-10.04  sec  64.9 MBytes  54.3 Mbits/sec                  receiver <<<

## After Revert
- git checkout v6.6.92
- git revert e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
- build, boot, test with `iperf -c <ip>`
	[  5] local 192.168.1.52 port 44136 connected to 192.168.1.203 port 5201
	[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
	[  5]   0.00-1.00   sec  90.5 MBytes   759 Mbits/sec  117    261 KBytes       
	[  5]   1.00-2.00   sec   113 MBytes   949 Mbits/sec    5    264 KBytes       
	[  5]   2.00-3.00   sec   113 MBytes   952 Mbits/sec    3    274 KBytes       
	[  5]   3.00-4.00   sec   113 MBytes   947 Mbits/sec    5    267 KBytes       
	[  5]   4.00-5.00   sec   113 MBytes   949 Mbits/sec    3    248 KBytes       
	[  5]   5.00-6.00   sec   113 MBytes   951 Mbits/sec    8    247 KBytes       
	[  5]   6.00-7.00   sec   113 MBytes   947 Mbits/sec    5    252 KBytes       
	[  5]   7.00-8.00   sec   113 MBytes   950 Mbits/sec    6    247 KBytes       
	[  5]   8.00-9.00   sec   113 MBytes   951 Mbits/sec    8    254 KBytes       
	[  5]   9.00-10.00  sec   113 MBytes   948 Mbits/sec    3    247 KBytes       
	- - - - - - - - - - - - - - - - - - - - - - - - -
	[ ID] Interval           Transfer     Bitrate         Retr
	[  5]   0.00-10.00  sec  1.08 GBytes   930 Mbits/sec  163               sender <<<
	[  5]   0.00-10.04  sec  1.08 GBytes   925 Mbits/sec                  receiver <<<



# Testing v6.15.1

## Before Revert
- git checkout v6.15.1
- build, boot, test with `iperf -c <ip>`
	[  5] local 192.168.1.52 port 52154 connected to 192.168.1.203 port 5201
	[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
	[  5]   0.00-1.00   sec  77.8 MBytes   652 Mbits/sec   73    298 KBytes       
	[  5]   1.00-2.00   sec  3.61 MBytes  30.3 Mbits/sec  530    388 KBytes       
	[  5]   2.00-3.00   sec  2.88 MBytes  24.2 Mbits/sec  1126    389 KBytes       
	[  5]   3.00-4.00   sec  3.06 MBytes  25.7 Mbits/sec  1750    456 KBytes       
	[  5]   4.00-5.00   sec  3.25 MBytes  27.2 Mbits/sec  1822    488 KBytes       
	[  5]   5.00-6.00   sec  3.43 MBytes  28.8 Mbits/sec  1506    530 KBytes       
	[  5]   6.00-7.00   sec  3.68 MBytes  30.8 Mbits/sec  1926    543 KBytes       
	[  5]   7.00-8.00   sec  2.48 MBytes  20.8 Mbits/sec  1675    609 KBytes       
	[  5]   8.00-9.00   sec  2.49 MBytes  20.9 Mbits/sec  941    332 KBytes       
	[  5]   9.00-10.00  sec  11.1 MBytes  93.4 Mbits/sec  747    358 KBytes       
	- - - - - - - - - - - - - - - - - - - - - - - - -
	[ ID] Interval           Transfer     Bitrate         Retr
	[  5]   0.00-10.00  sec   114 MBytes  95.5 Mbits/sec  12096             sender <<<
	[  5]   0.00-10.04  sec   110 MBytes  92.1 Mbits/sec                  receiver <<<

## After Revert
- git checkout v6.15.1
- git revert e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
- build, boot, test with `iperf -c <ip>`
	[  5] local 192.168.1.52 port 52266 connected to 192.168.1.203 port 5201
	[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
	[  5]   0.00-1.00   sec  91.3 MBytes   766 Mbits/sec   81    275 KBytes       
	[  5]   1.00-2.00   sec   113 MBytes   947 Mbits/sec    6    281 KBytes       
	[  5]   2.00-3.00   sec   113 MBytes   952 Mbits/sec    8    272 KBytes       
	[  5]   3.00-4.00   sec   113 MBytes   950 Mbits/sec    3    274 KBytes       
	[  5]   4.00-5.00   sec   113 MBytes   950 Mbits/sec    6    264 KBytes       
	[  5]   5.00-6.00   sec   113 MBytes   944 Mbits/sec    6    272 KBytes       
	[  5]   6.00-7.00   sec   114 MBytes   952 Mbits/sec    9    272 KBytes       
	[  5]   7.00-8.00   sec  89.0 MBytes   746 Mbits/sec   62    315 KBytes       
	[  5]   8.00-9.00   sec   113 MBytes   947 Mbits/sec    6    304 KBytes       
	[  5]   9.00-10.00  sec   113 MBytes   948 Mbits/sec    6    302 KBytes       
	- - - - - - - - - - - - - - - - - - - - - - - - -
	[ ID] Interval           Transfer     Bitrate         Retr
	[  5]   0.00-10.00  sec  1.06 GBytes   910 Mbits/sec  193               sender <<<
	[  5]   0.00-10.04  sec  1.06 GBytes   905 Mbits/sec                  receiver <<<



# git show e37ab7373696e650d3b6262a5b882aadad69bb9e|cat

commit e37ab7373696e650d3b6262a5b882aadad69bb9e
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Oct 1 20:05:15 2024 +0000

    tcp: fix to allow timestamp undo if no retransmits were sent
    
    Fix the TCP loss recovery undo logic in tcp_packet_delayed() so that
    it can trigger undo even if TSQ prevents a fast recovery episode from
    reaching tcp_retransmit_skb().
    
    Geumhwan Yu <geumhwan.yu@samsung.com> recently reported that after
    this commit from 2019:
    
    commit bc9f38c8328e ("tcp: avoid unconditional congestion window undo
    on SYN retransmit")
    
    ...and before this fix we could have buggy scenarios like the
    following:
    
    + Due to reordering, a TCP connection receives some SACKs and enters a
      spurious fast recovery.
    
    + TSQ prevents all invocations of tcp_retransmit_skb(), because many
      skbs are queued in lower layers of the sending machine's network
      stack; thus tp->retrans_stamp remains 0.
    
    + The connection receives a TCP timestamp ECR value echoing a
      timestamp before the fast recovery, indicating that the fast
      recovery was spurious.
    
    + The connection fails to undo the spurious fast recovery because
      tp->retrans_stamp is 0, and thus tcp_packet_delayed() returns false,
      due to the new logic in the 2019 commit: commit bc9f38c8328e ("tcp:
      avoid unconditional congestion window undo on SYN retransmit")
    
    This fix tweaks the logic to be more similar to the
    tcp_packet_delayed() logic before bc9f38c8328e, except that we take
    care not to be fooled by the FLAG_SYN_ACKED code path zeroing out
    tp->retrans_stamp (the bug noted and fixed by Yuchung in
    bc9f38c8328e).
    
    Note that this returns the high-level behavior of tcp_packet_delayed()
    to again match the comment for the function, which says: "Nothing was
    retransmitted or returned timestamp is less than timestamp of the
    first retransmission." Note that this comment is in the original
    2005-04-16 Linux git commit, so this is evidently long-standing
    behavior.
    
    Fixes: bc9f38c8328e ("tcp: avoid unconditional congestion window undo on SYN retransmit")
    Reported-by: Geumhwan Yu <geumhwan.yu@samsung.com>
    Diagnosed-by: Geumhwan Yu <geumhwan.yu@samsung.com>
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: Yuchung Cheng <ycheng@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://patch.msgid.link/20241001200517.2756803-2-ncardwell.sw@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index cc05ec1faac8..233b77890795 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2473,8 +2473,22 @@ static bool tcp_skb_spurious_retrans(const struct tcp_sock *tp,
  */
 static inline bool tcp_packet_delayed(const struct tcp_sock *tp)
 {
-	return tp->retrans_stamp &&
-	       tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
+	const struct sock *sk = (const struct sock *)tp;
+
+	if (tp->retrans_stamp &&
+	    tcp_tsopt_ecr_before(tp, tp->retrans_stamp))
+		return true;  /* got echoed TS before first retransmission */
+
+	/* Check if nothing was retransmitted (retrans_stamp==0), which may
+	 * happen in fast recovery due to TSQ. But we ignore zero retrans_stamp
+	 * in TCP_SYN_SENT, since when we set FLAG_SYN_ACKED we also clear
+	 * retrans_stamp even if we had retransmitted the SYN.
+	 */
+	if (!tp->retrans_stamp &&	   /* no record of a retransmit/SYN? */
+	    sk->sk_state != TCP_SYN_SENT)  /* not the FLAG_SYN_ACKED case? */
+		return true;  /* nothing was retransmitted */
+
+	return false;
 }
 
 /* Undo procedures. */


--
Eric Wheeler
www.linuxglobal.com

^ permalink raw reply related	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-06-26 20:16 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-06  1:32 [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent Eric Wheeler
2025-06-06 17:16 ` Neal Cardwell
2025-06-06 22:34   ` Eric Wheeler
2025-06-07 19:13     ` Neal Cardwell
2025-06-07 22:54       ` Neal Cardwell
2025-06-07 23:26         ` Neal Cardwell
2025-06-09 17:45           ` Neal Cardwell
2025-06-10 17:15             ` Neal Cardwell
2025-06-12 18:23               ` Neal Cardwell
2025-06-13 21:02                 ` Neal Cardwell
2025-06-15 20:00               ` Eric Wheeler
2025-06-16 20:13                 ` Eric Wheeler
2025-06-16 21:07                   ` Neal Cardwell
2025-06-18 22:03                     ` Eric Wheeler
2025-06-25 19:17                       ` Eric Wheeler
2025-06-25 20:19                         ` Neal Cardwell
2025-06-25 23:15                           ` Eric Wheeler
2025-06-26 14:21                             ` Neal Cardwell
2025-06-26 20:16                               ` Eric Wheeler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).