[BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
@ 2025-06-06  1:32 Eric Wheeler
  2025-06-06 17:16 ` Neal Cardwell
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Wheeler @ 2025-06-06  1:32 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Neal Cardwell,
	Sasha Levin, Yuchung Cheng, stable

Hello Neal,

After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
10GbE via one SFP+ DAC (no bonding), we found TCP performance with
existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
across the switch on 10Gbit ports runs at full 10GbE.

Interestingly, the problem only presents itself when transmitting 
from Linux; receive traffic (to Linux) performs just fine:
	~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE  -> device
	 ~1Gbit: device        =TX=>  1GbE -> switch -> 10GbE -> Linux v6.6.85

Through bisection, we found this first-bad commit:

	tcp: fix to allow timestamp undo if no retransmits were sent
		upstream: 	e37ab7373696e650d3b6262a5b882aadad69bb9e
		stable 6.6.y:	e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f

To validate the regression, we performed the procedures below using the 
latest versions of Linux. As you can see by comparing the performance 
measurements, it is 10-16x faster after reverting. This appears to affect 
everything after ~6.6.12-rc1 when the patch was introduced, as well as any 
stable releases that cherry-picked it. I have pasted the small commit that 
was reverted below for your reference.

Do you understand why it would behave this way, and what the correct fix 
(or possible workaround) would be? 

Currently we are able to reproduce this reliably, please let me know if 
you would like us to gather any additional information.

-Eric

# Testing v6.6.92

## Before Revert
- git checkout v6.6.92
- build, boot, test with `iperf -c <ip>`
	[  5] local 192.168.1.52 port 42886 connected to 192.168.1.203 port 5201
	[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
	[  5]   0.00-1.00   sec  41.5 MBytes   348 Mbits/sec  153    322 KBytes       
	[  5]   1.00-2.00   sec  3.68 MBytes  30.8 Mbits/sec  491    368 KBytes       
	[  5]   2.00-3.00   sec  3.00 MBytes  25.2 Mbits/sec  1477    425 KBytes       
	[  5]   3.00-4.00   sec  3.25 MBytes  27.2 Mbits/sec  1348   2.85 KBytes       
	[  5]   4.00-5.00   sec  3.43 MBytes  28.8 Mbits/sec  1875    498 KBytes       
	[  5]   5.00-6.00   sec  3.49 MBytes  29.3 Mbits/sec  1957    471 KBytes       
	[  5]   6.00-7.00   sec  2.48 MBytes  20.8 Mbits/sec  1463    538 KBytes       
	[  5]   7.00-8.00   sec  1.25 MBytes  10.5 Mbits/sec  1072    603 KBytes       
	[  5]   8.00-9.00   sec  3.71 MBytes  31.2 Mbits/sec  1362    593 KBytes       
	[  5]   9.00-10.00  sec  2.50 MBytes  21.0 Mbits/sec  1676    624 KBytes       
	- - - - - - - - - - - - - - - - - - - - - - - - -
	[ ID] Interval           Transfer     Bitrate         Retr
	[  5]   0.00-10.00  sec  68.3 MBytes  57.3 Mbits/sec  12874             sender <<<
	[  5]   0.00-10.04  sec  64.9 MBytes  54.3 Mbits/sec                  receiver <<<

## After Revert
- git checkout v6.6.92
- git revert e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
- build, boot, test with `iperf -c <ip>`
	[  5] local 192.168.1.52 port 44136 connected to 192.168.1.203 port 5201
	[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
	[  5]   0.00-1.00   sec  90.5 MBytes   759 Mbits/sec  117    261 KBytes       
	[  5]   1.00-2.00   sec   113 MBytes   949 Mbits/sec    5    264 KBytes       
	[  5]   2.00-3.00   sec   113 MBytes   952 Mbits/sec    3    274 KBytes       
	[  5]   3.00-4.00   sec   113 MBytes   947 Mbits/sec    5    267 KBytes       
	[  5]   4.00-5.00   sec   113 MBytes   949 Mbits/sec    3    248 KBytes       
	[  5]   5.00-6.00   sec   113 MBytes   951 Mbits/sec    8    247 KBytes       
	[  5]   6.00-7.00   sec   113 MBytes   947 Mbits/sec    5    252 KBytes       
	[  5]   7.00-8.00   sec   113 MBytes   950 Mbits/sec    6    247 KBytes       
	[  5]   8.00-9.00   sec   113 MBytes   951 Mbits/sec    8    254 KBytes       
	[  5]   9.00-10.00  sec   113 MBytes   948 Mbits/sec    3    247 KBytes       
	- - - - - - - - - - - - - - - - - - - - - - - - -
	[ ID] Interval           Transfer     Bitrate         Retr
	[  5]   0.00-10.00  sec  1.08 GBytes   930 Mbits/sec  163               sender <<<
	[  5]   0.00-10.04  sec  1.08 GBytes   925 Mbits/sec                  receiver <<<



# Testing v6.15.1

## Before Revert
- git checkout v6.15.1
- build, boot, test with `iperf -c <ip>`
	[  5] local 192.168.1.52 port 52154 connected to 192.168.1.203 port 5201
	[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
	[  5]   0.00-1.00   sec  77.8 MBytes   652 Mbits/sec   73    298 KBytes       
	[  5]   1.00-2.00   sec  3.61 MBytes  30.3 Mbits/sec  530    388 KBytes       
	[  5]   2.00-3.00   sec  2.88 MBytes  24.2 Mbits/sec  1126    389 KBytes       
	[  5]   3.00-4.00   sec  3.06 MBytes  25.7 Mbits/sec  1750    456 KBytes       
	[  5]   4.00-5.00   sec  3.25 MBytes  27.2 Mbits/sec  1822    488 KBytes       
	[  5]   5.00-6.00   sec  3.43 MBytes  28.8 Mbits/sec  1506    530 KBytes       
	[  5]   6.00-7.00   sec  3.68 MBytes  30.8 Mbits/sec  1926    543 KBytes       
	[  5]   7.00-8.00   sec  2.48 MBytes  20.8 Mbits/sec  1675    609 KBytes       
	[  5]   8.00-9.00   sec  2.49 MBytes  20.9 Mbits/sec  941    332 KBytes       
	[  5]   9.00-10.00  sec  11.1 MBytes  93.4 Mbits/sec  747    358 KBytes       
	- - - - - - - - - - - - - - - - - - - - - - - - -
	[ ID] Interval           Transfer     Bitrate         Retr
	[  5]   0.00-10.00  sec   114 MBytes  95.5 Mbits/sec  12096             sender <<<
	[  5]   0.00-10.04  sec   110 MBytes  92.1 Mbits/sec                  receiver <<<

## After Revert
- git checkout v6.15.1
- git revert e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
- build, boot, test with `iperf -c <ip>`
	[  5] local 192.168.1.52 port 52266 connected to 192.168.1.203 port 5201
	[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
	[  5]   0.00-1.00   sec  91.3 MBytes   766 Mbits/sec   81    275 KBytes       
	[  5]   1.00-2.00   sec   113 MBytes   947 Mbits/sec    6    281 KBytes       
	[  5]   2.00-3.00   sec   113 MBytes   952 Mbits/sec    8    272 KBytes       
	[  5]   3.00-4.00   sec   113 MBytes   950 Mbits/sec    3    274 KBytes       
	[  5]   4.00-5.00   sec   113 MBytes   950 Mbits/sec    6    264 KBytes       
	[  5]   5.00-6.00   sec   113 MBytes   944 Mbits/sec    6    272 KBytes       
	[  5]   6.00-7.00   sec   114 MBytes   952 Mbits/sec    9    272 KBytes       
	[  5]   7.00-8.00   sec  89.0 MBytes   746 Mbits/sec   62    315 KBytes       
	[  5]   8.00-9.00   sec   113 MBytes   947 Mbits/sec    6    304 KBytes       
	[  5]   9.00-10.00  sec   113 MBytes   948 Mbits/sec    6    302 KBytes       
	- - - - - - - - - - - - - - - - - - - - - - - - -
	[ ID] Interval           Transfer     Bitrate         Retr
	[  5]   0.00-10.00  sec  1.06 GBytes   910 Mbits/sec  193               sender <<<
	[  5]   0.00-10.04  sec  1.06 GBytes   905 Mbits/sec                  receiver <<<



# git show e37ab7373696e650d3b6262a5b882aadad69bb9e|cat

commit e37ab7373696e650d3b6262a5b882aadad69bb9e
Author: Neal Cardwell <ncardwell@google.com>
Date:   Tue Oct 1 20:05:15 2024 +0000

    tcp: fix to allow timestamp undo if no retransmits were sent
    
    Fix the TCP loss recovery undo logic in tcp_packet_delayed() so that
    it can trigger undo even if TSQ prevents a fast recovery episode from
    reaching tcp_retransmit_skb().
    
    Geumhwan Yu <geumhwan.yu@samsung.com> recently reported that after
    this commit from 2019:
    
    commit bc9f38c8328e ("tcp: avoid unconditional congestion window undo
    on SYN retransmit")
    
    ...and before this fix we could have buggy scenarios like the
    following:
    
    + Due to reordering, a TCP connection receives some SACKs and enters a
      spurious fast recovery.
    
    + TSQ prevents all invocations of tcp_retransmit_skb(), because many
      skbs are queued in lower layers of the sending machine's network
      stack; thus tp->retrans_stamp remains 0.
    
    + The connection receives a TCP timestamp ECR value echoing a
      timestamp before the fast recovery, indicating that the fast
      recovery was spurious.
    
    + The connection fails to undo the spurious fast recovery because
      tp->retrans_stamp is 0, and thus tcp_packet_delayed() returns false,
      due to the new logic in the 2019 commit: commit bc9f38c8328e ("tcp:
      avoid unconditional congestion window undo on SYN retransmit")
    
    This fix tweaks the logic to be more similar to the
    tcp_packet_delayed() logic before bc9f38c8328e, except that we take
    care not to be fooled by the FLAG_SYN_ACKED code path zeroing out
    tp->retrans_stamp (the bug noted and fixed by Yuchung in
    bc9f38c8328e).
    
    Note that this returns the high-level behavior of tcp_packet_delayed()
    to again match the comment for the function, which says: "Nothing was
    retransmitted or returned timestamp is less than timestamp of the
    first retransmission." Note that this comment is in the original
    2005-04-16 Linux git commit, so this is evidently long-standing
    behavior.
    
    Fixes: bc9f38c8328e ("tcp: avoid unconditional congestion window undo on SYN retransmit")
    Reported-by: Geumhwan Yu <geumhwan.yu@samsung.com>
    Diagnosed-by: Geumhwan Yu <geumhwan.yu@samsung.com>
    Signed-off-by: Neal Cardwell <ncardwell@google.com>
    Signed-off-by: Yuchung Cheng <ycheng@google.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Link: https://patch.msgid.link/20241001200517.2756803-2-ncardwell.sw@gmail.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index cc05ec1faac8..233b77890795 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2473,8 +2473,22 @@ static bool tcp_skb_spurious_retrans(const struct tcp_sock *tp,
  */
 static inline bool tcp_packet_delayed(const struct tcp_sock *tp)
 {
-	return tp->retrans_stamp &&
-	       tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
+	const struct sock *sk = (const struct sock *)tp;
+
+	if (tp->retrans_stamp &&
+	    tcp_tsopt_ecr_before(tp, tp->retrans_stamp))
+		return true;  /* got echoed TS before first retransmission */
+
+	/* Check if nothing was retransmitted (retrans_stamp==0), which may
+	 * happen in fast recovery due to TSQ. But we ignore zero retrans_stamp
+	 * in TCP_SYN_SENT, since when we set FLAG_SYN_ACKED we also clear
+	 * retrans_stamp even if we had retransmitted the SYN.
+	 */
+	if (!tp->retrans_stamp &&	   /* no record of a retransmit/SYN? */
+	    sk->sk_state != TCP_SYN_SENT)  /* not the FLAG_SYN_ACKED case? */
+		return true;  /* nothing was retransmitted */
+
+	return false;
 }
 
 /* Undo procedures. */


--
Eric Wheeler
www.linuxglobal.com

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
  2025-06-06  1:32 [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent Eric Wheeler
@ 2025-06-06 17:16 ` Neal Cardwell
  2025-06-06 22:34   ` Eric Wheeler
  0 siblings, 1 reply; 19+ messages in thread
From: Neal Cardwell @ 2025-06-06 17:16 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
	Yuchung Cheng, stable

On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
>
> Hello Neal,
>
> After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> across the switch on 10Gbit ports runs at full 10GbE.
>
> Interestingly, the problem only presents itself when transmitting
> from Linux; receive traffic (to Linux) performs just fine:
>         ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE  -> device
>          ~1Gbit: device        =TX=>  1GbE -> switch -> 10GbE -> Linux v6.6.85
>
> Through bisection, we found this first-bad commit:
>
>         tcp: fix to allow timestamp undo if no retransmits were sent
>                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
>                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
>
> To validate the regression, we performed the procedures below using the
> latest versions of Linux. As you can see by comparing the performance
> measurements, it is 10-16x faster after reverting. This appears to affect
> everything after ~6.6.12-rc1 when the patch was introduced, as well as any
> stable releases that cherry-picked it. I have pasted the small commit that
> was reverted below for your reference.
>
> Do you understand why it would behave this way, and what the correct fix
> (or possible workaround) would be?
>
> Currently we are able to reproduce this reliably, please let me know if
> you would like us to gather any additional information.

Hi Eric,

Thank you for your detailed report and your offer to run some more tests!

I don't have any good theories yet. It is striking that the apparent
retransmit rate is more than 100x higher in your "Before Revert" case
than in your "After Revert" case. It seems like something very odd is
going on. :-)

If you could re-run tests while gathering more information, and share
that information, that would be very useful.

What would be very useful would be the following information, for both
(a) Before Revert, and (b) After Revert kernels:

# as root, before the test starts, start instrumentation
# and leave it running in the background; something like:
(while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/ss.txt &
nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done)  >
/tmp/nstat.txt &
tcpdump -w /tmp/tcpdump.${eth}.pcap -n -s 116 -c 1000000  &

# then run the test

# then kill the instrumentation loops running in the background:
kill %1 %2 %3

Then if you could copy the iperf output and these output files to a
web server, or Dropbox, or Google Drive, etc, and share the URL, I
would be very grateful.

For this next phase, there's no need to test both 6.6 and 6.15.
Testing either one is fine. We just need, say, 6.15 before the revert,
and 6.15 after the revert.

Thanks!
neal

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
  2025-06-06 17:16 ` Neal Cardwell
@ 2025-06-06 22:34   ` Eric Wheeler
  2025-06-07 19:13     ` Neal Cardwell
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Wheeler @ 2025-06-06 22:34 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
	Yuchung Cheng, stable

[-- Attachment #1: Type: text/plain, Size: 7735 bytes --]

On Fri, 6 Jun 2025, Neal Cardwell wrote:
> On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> >
> > Hello Neal,
> >
> > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > across the switch on 10Gbit ports runs at full 10GbE.
> >
> > Interestingly, the problem only presents itself when transmitting
> > from Linux; receive traffic (to Linux) performs just fine:
> >         ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE  -> device
> >          ~1Gbit: device        =TX=>  1GbE -> switch -> 10GbE -> Linux v6.6.85
> >
> > Through bisection, we found this first-bad commit:
> >
> >         tcp: fix to allow timestamp undo if no retransmits were sent
> >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> >
> 
> Thank you for your detailed report and your offer to run some more tests!
> 
> I don't have any good theories yet. It is striking that the apparent
> retransmit rate is more than 100x higher in your "Before Revert" case
> than in your "After Revert" case. It seems like something very odd is
> going on. :-)

good point, I wonder what that might imply...

> If you could re-run tests while gathering more information, and share
> that information, that would be very useful.
> 
> What would be very useful would be the following information, for both
> (a) Before Revert, and (b) After Revert kernels:
> 
> # as root, before the test starts, start instrumentation
> # and leave it running in the background; something like:
> (while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/ss.txt &
> nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done)  >
> /tmp/nstat.txt &
> tcpdump -w /tmp/tcpdump.${eth}.pcap -n -s 116 -c 1000000  &
> 
> # then run the test
> 
> # then kill the instrumentation loops running in the background:
> kill %1 %2 %3

Sure, here they are:

	https://www.linuxglobal.com/out/for-neal/

These are the commands that we ran.  You will probably notice that it is
running under a bridge, but the behavior is the same whether or not it
is enslaved to a bridge. (The way that these systems are configured it is
quite difficult to mess with the bridge so I would like to keep it as it
is for testing if possible.)

# Before Revert

	WHEN=before-revert
	(while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/$WHEN-ss.txt &
	nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done)  > /tmp/$WHEN-nstat.txt &
	tcpdump -i br0 -w /tmp/$WHEN-tcpdump.${eth}.pcap -n -s 116 -c 1000000 host 192.168.1.203 &
	iperf3 -c 192.168.1.203
	kill %1 %2 %3

	[1] 1769507
	[2] 1769511
	[3] 1769512
	Connecting to host 192.168.1.203, port 5201
	dropped privs to tcpdump
	tcpdump: listening on br0, link-type EN10MB (Ethernet), snapshot length 116 bytes
	[  5] local 192.168.1.51 port 44674 connected to 192.168.1.203 port 5201
	[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
	[  5]   0.00-1.00   sec   110 MBytes   920 Mbits/sec   28    278 KBytes       
	[  5]   1.00-2.00   sec  3.19 MBytes  26.7 Mbits/sec  260    336 KBytes       
	[  5]   2.00-3.00   sec  3.06 MBytes  25.7 Mbits/sec  862    431 KBytes       
	[  5]   3.00-4.00   sec  3.12 MBytes  26.2 Mbits/sec  1730    462 KBytes       
	[  5]   4.00-5.00   sec  3.25 MBytes  27.2 Mbits/sec  1490    443 KBytes       
	[  5]   5.00-6.00   sec  3.31 MBytes  27.8 Mbits/sec  1898    543 KBytes       
	[  5]   6.00-7.00   sec  2.45 MBytes  20.5 Mbits/sec  1640    111 KBytes       
	[  5]   7.00-8.00   sec  3.70 MBytes  31.0 Mbits/sec  1868    530 KBytes       
	[  5]   8.00-9.00   sec  3.71 MBytes  31.1 Mbits/sec  2137    539 KBytes       
	[  5]   9.00-10.00  sec  3.75 MBytes  31.5 Mbits/sec  1012    365 KBytes       
	- - - - - - - - - - - - - - - - - - - - - - - - -
	[ ID] Interval           Transfer     Bitrate         Retr
	[  5]   0.00-10.00  sec   139 MBytes   117 Mbits/sec  12925             sender
	[  5]   0.00-10.04  sec   137 MBytes   114 Mbits/sec                  receiver

	iperf Done.
	35180 packets captured
	35607 packets received by filter
	0 packets dropped by kernel
	[root@hv ~]# renice -20 $$
	1760056 (process ID) old priority 0, new priority -20
	[1]   Terminated              ( while true; do
	    date +%s.%N; ss -tenmoi; sleep 0.050;
	done ) > /tmp/$WHEN-ss.txt
	[2]-  Terminated              ( while true; do
	    date +%s.%N; nstat; sleep 0.050;
	done ) > /tmp/$WHEN-nstat.txt
	[3]+  Done                    tcpdump -i br0 -w /tmp/$WHEN-tcpdump.${eth}.pcap -n -s 116 -c 1000000 host 192.168.1.203
	[root@hv ~]# tar cvzf $WHEN.tar.gz /tmp/$WHEN*


# After Revert

	eth=br0
	WHEN=after-revert
	(while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/$WHEN-ss.txt &
	nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done)  > /tmp/$WHEN-nstat.txt &
	tcpdump -i br0 -w /tmp/$WHEN-tcpdump.${eth}.pcap -n -s 116 -c 1000000 host 192.168.1.203 &
	iperf3 -c 192.168.1.203
	kill %1 %2 %3

	[1] 1471593
	[2] 1471597
	[3] 1471598
	Connecting to host 192.168.1.203, port 5201
	dropped privs to tcpdump
	tcpdump: listening on br0, link-type EN10MB (Ethernet), snapshot length 116 bytes
	[  5] local 192.168.1.52 port 41240 connected to 192.168.1.203 port 5201
	[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
	[  5]   0.00-1.00   sec   113 MBytes   948 Mbits/sec   12    282 KBytes       
	[  5]   1.00-2.00   sec  90.5 MBytes   759 Mbits/sec   73    288 KBytes       
	[  5]   2.00-3.00   sec   114 MBytes   952 Mbits/sec    4    287 KBytes       
	[  5]   3.00-4.00   sec  89.9 MBytes   754 Mbits/sec   56    298 KBytes       
	[  5]   4.00-5.00   sec   113 MBytes   945 Mbits/sec   26    247 KBytes       
	[  5]   5.00-6.00   sec   113 MBytes   946 Mbits/sec    4    261 KBytes       
	[  5]   6.00-7.00   sec   113 MBytes   947 Mbits/sec    8    267 KBytes       
	[  5]   7.00-8.00   sec  89.4 MBytes   750 Mbits/sec   74    318 KBytes       
	[  5]   8.00-9.00   sec  89.7 MBytes   752 Mbits/sec   83    269 KBytes       
	[  5]   9.00-10.00  sec  90.2 MBytes   757 Mbits/sec  110    315 KBytes       
	- - - - - - - - - - - - - - - - - - - - - - - - -
	[ ID] Interval           Transfer     Bitrate         Retr
	[  5]   0.00-10.00  sec  1014 MBytes   851 Mbits/sec  450             sender
	[  5]   0.00-10.04  sec  1013 MBytes   846 Mbits/sec                  receiver

	iperf Done.
	131027 packets captured
	131841 packets received by filter
	0 packets dropped by kernel
	[root@hv ~]# tar cvzf after-revert /tmp/before*
	tar: Removing leading `/' from member names
	tar: /tmp/before*: Cannot stat: No such file or directory
	tar: Exiting with failure status due to previous errors
	[1]   Terminated              ( while true; do
	    date +%s.%N; ss -tenmoi; sleep 0.050;
	done ) > /tmp/$WHEN-ss.txt
	[2]-  Terminated              ( while true; do
	    date +%s.%N; nstat; sleep 0.050;
	done ) > /tmp/$WHEN-nstat.txt
	[3]+  Done                    tcpdump -i br0 -w /tmp/$WHEN-tcpdump.${eth}.pcap -n -s 116 -c 1000000 host 192.168.1.203
	[root@hv ~]# tar cvzf $WHEN.tar.gz /tmp/$WHEN*


-Eric

> Then if you could copy the iperf output and these output files to a
> web server, or Dropbox, or Google Drive, etc, and share the URL, I
> would be very grateful.
> 
> For this next phase, there's no need to test both 6.6 and 6.15.
> Testing either one is fine. We just need, say, 6.15 before the revert,
> and 6.15 after the revert.
> 
> Thanks!
> neal
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
  2025-06-06 22:34   ` Eric Wheeler
@ 2025-06-07 19:13     ` Neal Cardwell
  2025-06-07 22:54       ` Neal Cardwell
  0 siblings, 1 reply; 19+ messages in thread
From: Neal Cardwell @ 2025-06-07 19:13 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
	Yuchung Cheng, stable

On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
>
> On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > >
> > > Hello Neal,
> > >
> > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > across the switch on 10Gbit ports runs at full 10GbE.
> > >
> > > Interestingly, the problem only presents itself when transmitting
> > > from Linux; receive traffic (to Linux) performs just fine:
> > >         ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE  -> device
> > >          ~1Gbit: device        =TX=>  1GbE -> switch -> 10GbE -> Linux v6.6.85
> > >
> > > Through bisection, we found this first-bad commit:
> > >
> > >         tcp: fix to allow timestamp undo if no retransmits were sent
> > >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> > >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > >
> >
> > Thank you for your detailed report and your offer to run some more tests!
> >
> > I don't have any good theories yet. It is striking that the apparent
> > retransmit rate is more than 100x higher in your "Before Revert" case
> > than in your "After Revert" case. It seems like something very odd is
> > going on. :-)
>
> good point, I wonder what that might imply...
>
> > If you could re-run tests while gathering more information, and share
> > that information, that would be very useful.
> >
> > What would be very useful would be the following information, for both
> > (a) Before Revert, and (b) After Revert kernels:
> >
> > # as root, before the test starts, start instrumentation
> > # and leave it running in the background; something like:
> > (while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/ss.txt &
> > nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done)  >
> > /tmp/nstat.txt &
> > tcpdump -w /tmp/tcpdump.${eth}.pcap -n -s 116 -c 1000000  &
> >
> > # then run the test
> >
> > # then kill the instrumentation loops running in the background:
> > kill %1 %2 %3
>
> Sure, here they are:
>
>         https://www.linuxglobal.com/out/for-neal/

Hi Eric,

Many thanks for the traces! These traces clearly show the buggy
behavior. The problem is an interaction between the non-SACK behavior
on these connections (due to the non-Linux "device" not supporting
SACK) and the undo logic. The problem is that, for non-SACK
connections, tcp_is_non_sack_preventing_reopen() holds steady in
CA_Recovery or CA_Loss at the end of a loss recovery episode but
clears tp->retrans_stamp to 0. So that upon the next ACK the "tcp: fix
to allow timestamp undo if no retransmits were sent" sees the
tp->retrans_stamp at 0 and erroneously concludes that no data was
retransmitted, and erroneously performs an undo of the cwnd reduction,
restoring cwnd immediately to the value it had before loss recovery.
This causes an immediate build-up of queues and another immediate loss
recovery episode. Thus the higher retransmit rate in the buggy
scenario.

I will work on a packetdrill reproducer, test a fix, and post a patch
for testing. I think the simplest fix would be to have
tcp_packet_delayed(), when tp->retrans_stamp is zero, check for the
(tp->snd_una == tp->high_seq && tcp_is_reno(tp)) condition and not
allow tcp_packet_delayed() to return true in that case. That should be
a precise fix for this scenario and does not risk changing behavior
for the much more common case of loss recovery with SACK support.

Eric, would you be willing to test a simple bug fix patch for this?

Thanks!

neal

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
  2025-06-07 19:13     ` Neal Cardwell
@ 2025-06-07 22:54       ` Neal Cardwell
  2025-06-07 23:26         ` Neal Cardwell
  0 siblings, 1 reply; 19+ messages in thread
From: Neal Cardwell @ 2025-06-07 22:54 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
	Yuchung Cheng, stable

On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> >
> > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > >
> > > > Hello Neal,
> > > >
> > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > >
> > > > Interestingly, the problem only presents itself when transmitting
> > > > from Linux; receive traffic (to Linux) performs just fine:
> > > >         ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE  -> device
> > > >          ~1Gbit: device        =TX=>  1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > >
> > > > Through bisection, we found this first-bad commit:
> > > >
> > > >         tcp: fix to allow timestamp undo if no retransmits were sent
> > > >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > >
> > >
> > > Thank you for your detailed report and your offer to run some more tests!
> > >
> > > I don't have any good theories yet. It is striking that the apparent
> > > retransmit rate is more than 100x higher in your "Before Revert" case
> > > than in your "After Revert" case. It seems like something very odd is
> > > going on. :-)
> >
> > good point, I wonder what that might imply...
> >
> > > If you could re-run tests while gathering more information, and share
> > > that information, that would be very useful.
> > >
> > > What would be very useful would be the following information, for both
> > > (a) Before Revert, and (b) After Revert kernels:
> > >
> > > # as root, before the test starts, start instrumentation
> > > # and leave it running in the background; something like:
> > > (while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/ss.txt &
> > > nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done)  >
> > > /tmp/nstat.txt &
> > > tcpdump -w /tmp/tcpdump.${eth}.pcap -n -s 116 -c 1000000  &
> > >
> > > # then run the test
> > >
> > > # then kill the instrumentation loops running in the background:
> > > kill %1 %2 %3
> >
> > Sure, here they are:
> >
> >         https://www.linuxglobal.com/out/for-neal/
>
> Hi Eric,
>
> Many thanks for the traces! These traces clearly show the buggy
> behavior. The problem is an interaction between the non-SACK behavior
> on these connections (due to the non-Linux "device" not supporting
> SACK) and the undo logic. The problem is that, for non-SACK
> connections, tcp_is_non_sack_preventing_reopen() holds steady in
> CA_Recovery or CA_Loss at the end of a loss recovery episode but
> clears tp->retrans_stamp to 0. So that upon the next ACK the "tcp: fix
> to allow timestamp undo if no retransmits were sent" sees the
> tp->retrans_stamp at 0 and erroneously concludes that no data was
> retransmitted, and erroneously performs an undo of the cwnd reduction,
> restoring cwnd immediately to the value it had before loss recovery.
> This causes an immediate build-up of queues and another immediate loss
> recovery episode. Thus the higher retransmit rate in the buggy
> scenario.
>
> I will work on a packetdrill reproducer, test a fix, and post a patch
> for testing. I think the simplest fix would be to have
> tcp_packet_delayed(), when tp->retrans_stamp is zero, check for the
> (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) condition and not
> allow tcp_packet_delayed() to return true in that case. That should be
> a precise fix for this scenario and does not risk changing behavior
> for the much more common case of loss recovery with SACK support.

Indeed, I'm able to reproduce this issue with erroneous undo events on
non-SACK connections at the end of loss recovery with the attached
packetdrill script.

When you run that script on a kernel with the "tcp: fix to allow
timestamp undo if no retransmits were sent" patch, we see:

+ nstat shows an erroneous TcpExtTCPFullUndo event

+ the loss recovery reduces cwnd from the initial 10 to the correct 7
(from CUBIC) but then the erroneous undo event restores the pre-loss
cwnd of 10 and leads to a final cwnd value of 11

I will test a patch with the proposed fix and report back.

neal

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
  2025-06-07 22:54       ` Neal Cardwell
@ 2025-06-07 23:26         ` Neal Cardwell
  2025-06-09 17:45           ` Neal Cardwell
  0 siblings, 1 reply; 19+ messages in thread
From: Neal Cardwell @ 2025-06-07 23:26 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
	Yuchung Cheng, stable

[-- Attachment #1: Type: text/plain, Size: 5031 bytes --]

On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > >
> > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > >
> > > > > Hello Neal,
> > > > >
> > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > >
> > > > > Interestingly, the problem only presents itself when transmitting
> > > > > from Linux; receive traffic (to Linux) performs just fine:
> > > > >         ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE  -> device
> > > > >          ~1Gbit: device        =TX=>  1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > > >
> > > > > Through bisection, we found this first-bad commit:
> > > > >
> > > > >         tcp: fix to allow timestamp undo if no retransmits were sent
> > > > >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > > >
> > > >
> > > > Thank you for your detailed report and your offer to run some more tests!
> > > >
> > > > I don't have any good theories yet. It is striking that the apparent
> > > > retransmit rate is more than 100x higher in your "Before Revert" case
> > > > than in your "After Revert" case. It seems like something very odd is
> > > > going on. :-)
> > >
> > > good point, I wonder what that might imply...
> > >
> > > > If you could re-run tests while gathering more information, and share
> > > > that information, that would be very useful.
> > > >
> > > > What would be very useful would be the following information, for both
> > > > (a) Before Revert, and (b) After Revert kernels:
> > > >
> > > > # as root, before the test starts, start instrumentation
> > > > # and leave it running in the background; something like:
> > > > (while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/ss.txt &
> > > > nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done)  >
> > > > /tmp/nstat.txt &
> > > > tcpdump -w /tmp/tcpdump.${eth}.pcap -n -s 116 -c 1000000  &
> > > >
> > > > # then run the test
> > > >
> > > > # then kill the instrumentation loops running in the background:
> > > > kill %1 %2 %3
> > >
> > > Sure, here they are:
> > >
> > >         https://www.linuxglobal.com/out/for-neal/
> >
> > Hi Eric,
> >
> > Many thanks for the traces! These traces clearly show the buggy
> > behavior. The problem is an interaction between the non-SACK behavior
> > on these connections (due to the non-Linux "device" not supporting
> > SACK) and the undo logic. The problem is that, for non-SACK
> > connections, tcp_is_non_sack_preventing_reopen() holds steady in
> > CA_Recovery or CA_Loss at the end of a loss recovery episode but
> > clears tp->retrans_stamp to 0. So that upon the next ACK the "tcp: fix
> > to allow timestamp undo if no retransmits were sent" sees the
> > tp->retrans_stamp at 0 and erroneously concludes that no data was
> > retransmitted, and erroneously performs an undo of the cwnd reduction,
> > restoring cwnd immediately to the value it had before loss recovery.
> > This causes an immediate build-up of queues and another immediate loss
> > recovery episode. Thus the higher retransmit rate in the buggy
> > scenario.
> >
> > I will work on a packetdrill reproducer, test a fix, and post a patch
> > for testing. I think the simplest fix would be to have
> > tcp_packet_delayed(), when tp->retrans_stamp is zero, check for the
> > (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) condition and not
> > allow tcp_packet_delayed() to return true in that case. That should be
> > a precise fix for this scenario and does not risk changing behavior
> > for the much more common case of loss recovery with SACK support.
>
> Indeed, I'm able to reproduce this issue with erroneous undo events on
> non-SACK connections at the end of loss recovery with the attached
> packetdrill script.
>
> When you run that script on a kernel with the "tcp: fix to allow
> timestamp undo if no retransmits were sent" patch, we see:
>
> + nstat shows an erroneous TcpExtTCPFullUndo event
>
> + the loss recovery reduces cwnd from the initial 10 to the correct 7
> (from CUBIC) but then the erroneous undo event restores the pre-loss
> cwnd of 10 and leads to a final cwnd value of 11
>
> I will test a patch with the proposed fix and report back.

Oops, forgot to attach the packetdrill script! Let's try again...

neal

[-- Attachment #2: fr-non-sack-hold-at-high-seq.pkt --]
[-- Type: application/octet-stream, Size: 2362 bytes --]

// Test that in non-SACK fast recovery, we stay in CA_Recovery when
// snd_una == high_seq, and correctly leave when snd_una > high_seq.
// And that this does not cause a spurious undo.

// Set up config.
`../common/defaults.sh`

// Establish a connection.
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

  +0 `nstat -n`

   +0 < S 0:0(0) win 32792 <mss 1000,nop,wscale 7>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 8>
+.020 < . 1:1(0) ack 1 win 320
   +0 accept(3, ..., ...) = 4

// Write some data, and send the initial congestion window.
   +0 write(4, ..., 15000) = 15000
   +0 > P. 1:10001(10000) ack 1

// Limited transmit: on first dupack (for pkt 2), send a new data segment.
+.020 < . 1:1(0) ack 1 win 320
   +0 > . 10001:11001(1000) ack 1
  +0 %{ print(tcpi_snd_cwnd) }%
  +0 %{ print(tcpi_sacked) }%

// Limited transmit: on second dupack (for pkt 3), send a new data segment.
+.002 < . 1:1(0) ack 1 win 320
   +0 > . 11001:12001(1000) ack 1
  +0 %{ assert tcpi_snd_cwnd == 10,  tcpi_snd_cwnd }%
  +0 %{ print(tcpi_sacked) }%


// On third dupack (for pkt 4), enter fast recovery.
   +0 < . 1:1(0) ack 1 win 320
   +0 > . 1:1001(1000) ack 1
   +0 %{ assert tcpi_ca_state == TCP_CA_Recovery, tcpi_ca_state }%

// Receive dupack for pkt 5:
+.002 < . 1:1(0) ack 1 win 320

// Receive dupack for pkt 6:
+.002 < . 1:1(0) ack 1 win 320

// Receive dupack for pkt 7:
+.002 < . 1:1(0) ack 1 win 320

// Receive dupack for pkt 8:
+.002 < . 1:1(0) ack 1 win 320

// Receive dupack for pkt 9:
+.002 < . 1:1(0) ack 1 win 320

// Receive dupack for pkt 10:
+.002 < . 1:1(0) ack 1 win 320

// Receive dupack for limited transmit of pkt 11:
+.002 < . 1:1(0) ack 1 win 320

// Receive dupack for limited transmit of pkt 12:
+.002 < . 1:1(0) ack 1 win 320

// Receive cumulative ACK for fast retransmit that plugged the sequence hole.
// Because this is a non-SACK connection and snd_una == high_seq,
// we stay in CA_Recovery.
+.020 < . 1:1(0) ack 12001 win 320
   +0 %{ assert tcpi_ca_state == TCP_CA_Recovery, tcpi_ca_state }%
   +0 %{ print(tcpi_snd_cwnd) }%

// Receive ACK advancing snd_una past high_seq
+.002 < . 1:1(0) ack 13001 win 320
   +0 %{ assert tcpi_ca_state == TCP_CA_Open, tcpi_ca_state }%
   +0 %{ print(tcpi_snd_cwnd) }%
   +0 `nstat`

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
  2025-06-07 23:26         ` Neal Cardwell
@ 2025-06-09 17:45           ` Neal Cardwell
  2025-06-10 17:15             ` Neal Cardwell
  0 siblings, 1 reply; 19+ messages in thread
From: Neal Cardwell @ 2025-06-09 17:45 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
	Yuchung Cheng, stable

[-- Attachment #1: Type: text/plain, Size: 5662 bytes --]

On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > >
> > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > >
> > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > >
> > > > > > Hello Neal,
> > > > > >
> > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > >
> > > > > > Interestingly, the problem only presents itself when transmitting
> > > > > > from Linux; receive traffic (to Linux) performs just fine:
> > > > > >         ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE  -> device
> > > > > >          ~1Gbit: device        =TX=>  1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > > > >
> > > > > > Through bisection, we found this first-bad commit:
> > > > > >
> > > > > >         tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > > > >
> > > > >
> > > > > Thank you for your detailed report and your offer to run some more tests!
> > > > >
> > > > > I don't have any good theories yet. It is striking that the apparent
> > > > > retransmit rate is more than 100x higher in your "Before Revert" case
> > > > > than in your "After Revert" case. It seems like something very odd is
> > > > > going on. :-)
> > > >
> > > > good point, I wonder what that might imply...
> > > >
> > > > > If you could re-run tests while gathering more information, and share
> > > > > that information, that would be very useful.
> > > > >
> > > > > What would be very useful would be the following information, for both
> > > > > (a) Before Revert, and (b) After Revert kernels:
> > > > >
> > > > > # as root, before the test starts, start instrumentation
> > > > > # and leave it running in the background; something like:
> > > > > (while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/ss.txt &
> > > > > nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done)  >
> > > > > /tmp/nstat.txt &
> > > > > tcpdump -w /tmp/tcpdump.${eth}.pcap -n -s 116 -c 1000000  &
> > > > >
> > > > > # then run the test
> > > > >
> > > > > # then kill the instrumentation loops running in the background:
> > > > > kill %1 %2 %3
> > > >
> > > > Sure, here they are:
> > > >
> > > >         https://www.linuxglobal.com/out/for-neal/
> > >
> > > Hi Eric,
> > >
> > > Many thanks for the traces! These traces clearly show the buggy
> > > behavior. The problem is an interaction between the non-SACK behavior
> > > on these connections (due to the non-Linux "device" not supporting
> > > SACK) and the undo logic. The problem is that, for non-SACK
> > > connections, tcp_is_non_sack_preventing_reopen() holds steady in
> > > CA_Recovery or CA_Loss at the end of a loss recovery episode but
> > > clears tp->retrans_stamp to 0. So that upon the next ACK the "tcp: fix
> > > to allow timestamp undo if no retransmits were sent" sees the
> > > tp->retrans_stamp at 0 and erroneously concludes that no data was
> > > retransmitted, and erroneously performs an undo of the cwnd reduction,
> > > restoring cwnd immediately to the value it had before loss recovery.
> > > This causes an immediate build-up of queues and another immediate loss
> > > recovery episode. Thus the higher retransmit rate in the buggy
> > > scenario.
> > >
> > > I will work on a packetdrill reproducer, test a fix, and post a patch
> > > for testing. I think the simplest fix would be to have
> > > tcp_packet_delayed(), when tp->retrans_stamp is zero, check for the
> > > (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) condition and not
> > > allow tcp_packet_delayed() to return true in that case. That should be
> > > a precise fix for this scenario and does not risk changing behavior
> > > for the much more common case of loss recovery with SACK support.
> >
> > Indeed, I'm able to reproduce this issue with erroneous undo events on
> > non-SACK connections at the end of loss recovery with the attached
> > packetdrill script.
> >
> > When you run that script on a kernel with the "tcp: fix to allow
> > timestamp undo if no retransmits were sent" patch, we see:
> >
> > + nstat shows an erroneous TcpExtTCPFullUndo event
> >
> > + the loss recovery reduces cwnd from the initial 10 to the correct 7
> > (from CUBIC) but then the erroneous undo event restores the pre-loss
> > cwnd of 10 and leads to a final cwnd value of 11
> >
> > I will test a patch with the proposed fix and report back.

And the attached packetdrill script (which "passes" on kernel with
"tcp: fix to allow timestamp undo if no retransmits were sent") shows
that a similar erroneous undo (TcpExtTCPLossUndo in this case) happens
at the end of RTO recovery (CA_Loss) if there is an ACK that makes
snd_una exactly equal high_seq. This is expected, given that both fast
recovery and RTO recovery use tcp_is_non_sack_preventing_reopen().

neal

[-- Attachment #2: frto-real-timeout-nonsack-hold-at-high-seq.pkt --]
[-- Type: application/octet-stream, Size: 2028 bytes --]

// Test F-RTO on a real timeout without SACK.
// Identical to frto-real-timeout-nonsack.pkt
// except that there is an ACK that advances snd_una
// to exactly equal high_seq, to test this tricky case.

// Set up config.
`../common/defaults.sh`

    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 `nstat -n`

   +0 < S 0:0(0) win 32792 <mss 1000,nop,wscale 7>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 8>
 +.02 < . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4
   +0 write(4, ..., 15000) = 15000
   +0 > P. 1:10001(10000) ack 1

// RTO and retransmit head
 +.22 > . 1:1001(1000) ack 1

// F-RTO probes
 +.01 < . 1:1(0) ack 1001 win 257
   +0 > P. 10001:12001(2000) ack 1

// The probes are acked so the timeout is real.
 +.05 < . 1:1(0) ack 1001 win 257
   +0 > . 1001:2001(1000) ack 1
   +0 %{ assert tcpi_ca_state == TCP_CA_Loss, tcpi_ca_state }%
   +0 %{ assert tcpi_snd_cwnd == 2, tcpi_snd_cwnd }%
   +0 %{ assert tcpi_snd_ssthresh == 7, tcpi_snd_ssthresh }%
+.002 < . 1:1(0) ack 1001 win 257
   +0 > . 2001:3001(1000) ack 1

 +.05 < . 1:1(0) ack 3001 win 257
   +0 > . 3001:4001(1000) ack 1

 +.05 < . 1:1(0) ack 4001 win 257
   +0 > P. 4001:6001(2000) ack 1

 +.05 < . 1:1(0) ack 6001 win 257
   +0 > P. 6001:10001(4000) ack 1

// Receive ack and advance snd_una to match high_seq.
 +.05 < . 1:1(0) ack 10001 win 257
   +0 %{ assert tcpi_ca_state == TCP_CA_Loss, tcpi_ca_state }%
   +0 %{ assert tcpi_snd_cwnd == 7, tcpi_snd_cwnd }%
   +0 %{ assert tcpi_snd_ssthresh == 7, tcpi_snd_ssthresh }%

// Receive ack and advance snd_una beyond high_seq.
   +0 < . 1:1(0) ack 12001 win 257
   +0 > P. 12001:15001(3000) ack 1
   +0 %{ assert tcpi_ca_state == TCP_CA_Loss, tcpi_ca_state }%
   +0 %{ assert tcpi_snd_cwnd == 12, tcpi_snd_cwnd }%

 +.05 < . 1:1(0) ack 15001 win 257
   +0 %{ assert tcpi_ca_state == TCP_CA_Open, tcpi_ca_state }%
   +0 %{ assert tcpi_snd_cwnd == 15, tcpi_snd_cwnd }%

   +0 `nstat`

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
  2025-06-09 17:45           ` Neal Cardwell
@ 2025-06-10 17:15             ` Neal Cardwell
  2025-06-12 18:23               ` Neal Cardwell
  2025-06-15 20:00               ` Eric Wheeler
  0 siblings, 2 replies; 19+ messages in thread
From: Neal Cardwell @ 2025-06-10 17:15 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
	Yuchung Cheng, stable

[-- Attachment #1: Type: text/plain, Size: 2522 bytes --]

On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > >
> > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > >
> > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > >
> > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > >
> > > > > > > Hello Neal,
> > > > > > >
> > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > >
> > > > > > > Interestingly, the problem only presents itself when transmitting
> > > > > > > from Linux; receive traffic (to Linux) performs just fine:
> > > > > > >         ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE  -> device
> > > > > > >          ~1Gbit: device        =TX=>  1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > > > > >
> > > > > > > Through bisection, we found this first-bad commit:
> > > > > > >
> > > > > > >         tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f

Hi Eric,

Do you have cycles to test a proposed fix patch developed by our team?

The attached patch should apply (with "git am") for any recent kernel
that has the "tcp: fix to allow timestamp undo if no retransmits were
sent" patch it is fixing. So you should be able to test it on top of
the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
easier.

If you have cycles to rerun your iperf test, with  tcpdump, nstat, and
ss instrumentation, that would be fantastic!

The patch passes our internal packetdrill test suite, including new
tests for this issue (based on the packetdrill scripts posted earlier
in this thread.

But it would be fantastic to directly confirm that this fixes your issue.

Thanks!
neal

[-- Attachment #2: 0001-tcp-fix-tcp_packet_delayed-for-tcp_is_non_sack_preve.patch --]
[-- Type: application/octet-stream, Size: 4511 bytes --]

From 3cc3efcc051b27ced412c7d02e7784da8810751c Mon Sep 17 00:00:00 2001
From: Neal Cardwell <ncardwell@google.com>
Date: Sat, 7 Jun 2025 23:08:26 +0000
Subject: [PATCH] tcp: fix tcp_packet_delayed() for
 tcp_is_non_sack_preventing_reopen() behavior

After the following commit from 2024:

commit e37ab7373696 ("tcp: fix to allow timestamp undo if no retransmits were sent")

...there was buggy behavior where TCP connections without SACK support
could easily see erroneous undo events at the end of fast recovery or
RTO recovery episodes. The erroneous undo events could cause those
connections to be suffer repeated loss recovery episodes and high
retransmit rates.

The problem was an interaction between the non-SACK behavior on these
connections and the undo logic. The problem is that, for non-SACK
connections at the end of a loss recovery episode, if snd_una ==
high_seq, then tcp_is_non_sack_preventing_reopen() holds steady in
CA_Recovery or CA_Loss, but clears tp->retrans_stamp to 0. Then upon
the next ACK the "tcp: fix to allow timestamp undo if no retransmits
were sent" logic saw the tp->retrans_stamp at 0 and erroneously
concluded that no data was retransmitted, and erroneously performed an
undo of the cwnd reduction, restoring cwnd immediately to the value it
had before loss recovery.  This caused an immediate burst of traffic
and build-up of queues and likely another immediate loss recovery
episode.

This commit fixes tcp_packet_delayed() to ignore zero retrans_stamp
values for non-SACK connections when snd_una is at or above high_seq,
because tcp_is_non_sack_preventing_reopen() clears retrans_stamp in
this case, so it's not a valid signal that we can undo.

Note that the commit named in the Fixes footer restored long-present
behavior from roughly 2005-2019, so apparently this bug was present
for a while during that era, and this was simply not caught.

Fixes: e37ab7373696 ("tcp: fix to allow timestamp undo if no retransmits were sent")
Reported-by: Eric Wheeler <netdev@lists.ewheeler.net>
Closes: https://lore.kernel.org/netdev/64ea9333-e7f9-0df-b0f2-8d566143acab@ewheeler.net/
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Co-developed-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
 net/ipv4/tcp_input.c | 37 +++++++++++++++++++++++++------------
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a35018e2d0ba2..55e86275c823d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2484,20 +2484,33 @@ static inline bool tcp_packet_delayed(const struct tcp_sock *tp)
 {
 	const struct sock *sk = (const struct sock *)tp;
 
-	if (tp->retrans_stamp &&
-	    tcp_tsopt_ecr_before(tp, tp->retrans_stamp))
-		return true;  /* got echoed TS before first retransmission */
-
-	/* Check if nothing was retransmitted (retrans_stamp==0), which may
-	 * happen in fast recovery due to TSQ. But we ignore zero retrans_stamp
-	 * in TCP_SYN_SENT, since when we set FLAG_SYN_ACKED we also clear
-	 * retrans_stamp even if we had retransmitted the SYN.
+	/* Received an echoed timestamp before the first retransmission? */
+	if (tp->retrans_stamp)
+		return tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
+
+	/* We set tp->retrans_stamp upon the first retransmission of a loss
+	 * recovery episode, so normally if tp->retrans_stamp is 0 then no
+	 * retransmission has happened yet (likely due to TSQ, which can cause
+	 * fast retransmits to be delayed). So if snd_una advanced while
+	 * (tp->retrans_stamp is 0 then apparently a packet was merely delayed,
+	 * not lost. But there are exceptions where we retransmit but then
+	 * clear tp->retrans_stamp, so we check for those exceptions.
 	 */
-	if (!tp->retrans_stamp &&	   /* no record of a retransmit/SYN? */
-	    sk->sk_state != TCP_SYN_SENT)  /* not the FLAG_SYN_ACKED case? */
-		return true;  /* nothing was retransmitted */
 
-	return false;
+	/* (1) For non-SACK connections, tcp_is_non_sack_preventing_reopen()
+	 * clears tp->retrans_stamp when snd_una == high_seq.
+	 */
+	if (!tcp_is_sack(tp) && !before(tp->snd_una, tp->high_seq))
+		return false;
+
+	/* (2) In TCP_SYN_SENT tcp_clean_rtx_queue() clears tp->retrans_stamp
+	 * when setting FLAG_SYN_ACKED is set, even if the SYN was
+	 * retransmitted.
+	 */
+	if (sk->sk_state == TCP_SYN_SENT)
+		return false;
+
+	return true;	/* tp->retrans_stamp is zero; no retransmit yet */
 }
 
 /* Undo procedures. */
-- 
2.50.0.rc0.642.g800a2b2222-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
  2025-06-10 17:15             ` Neal Cardwell
@ 2025-06-12 18:23               ` Neal Cardwell
  2025-06-13 21:02                 ` Neal Cardwell
  2025-06-15 20:00               ` Eric Wheeler
  1 sibling, 1 reply; 19+ messages in thread
From: Neal Cardwell @ 2025-06-12 18:23 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
	Yuchung Cheng, stable

On Tue, Jun 10, 2025 at 1:15 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > >
> > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > >
> > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > >
> > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > >
> > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > >
> > > > > > > > Hello Neal,
> > > > > > > >
> > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > >
> > > > > > > > Interestingly, the problem only presents itself when transmitting
> > > > > > > > from Linux; receive traffic (to Linux) performs just fine:
> > > > > > > >         ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE  -> device
> > > > > > > >          ~1Gbit: device        =TX=>  1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > > > > > >
> > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > >
> > > > > > > >         tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
>
> Hi Eric,
>
> Do you have cycles to test a proposed fix patch developed by our team?
>
> The attached patch should apply (with "git am") for any recent kernel
> that has the "tcp: fix to allow timestamp undo if no retransmits were
> sent" patch it is fixing. So you should be able to test it on top of
> the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> easier.
>
> If you have cycles to rerun your iperf test, with  tcpdump, nstat, and
> ss instrumentation, that would be fantastic!
>
> The patch passes our internal packetdrill test suite, including new
> tests for this issue (based on the packetdrill scripts posted earlier
> in this thread.
>
> But it would be fantastic to directly confirm that this fixes your issue.

Hi Eric (Wheeler),

Just checking: would you be able to test that patch (from my previous
message) in your environment?

If not, given that the patch fixes our packetdrill reproducers, we can
send the patch to the list as-is without that testing.

Thanks,
neal

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
  2025-06-12 18:23               ` Neal Cardwell
@ 2025-06-13 21:02                 ` Neal Cardwell
  0 siblings, 0 replies; 19+ messages in thread
From: Neal Cardwell @ 2025-06-13 21:02 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
	Yuchung Cheng, stable

On Thu, Jun 12, 2025 at 2:23 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Tue, Jun 10, 2025 at 1:15 PM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> > >
> > > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > >
> > > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > >
> > > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > >
> > > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > >
> > > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > >
> > > > > > > > > Hello Neal,
> > > > > > > > >
> > > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > > >
> > > > > > > > > Interestingly, the problem only presents itself when transmitting
> > > > > > > > > from Linux; receive traffic (to Linux) performs just fine:
> > > > > > > > >         ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE  -> device
> > > > > > > > >          ~1Gbit: device        =TX=>  1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > > > > > > >
> > > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > > >
> > > > > > > > >         tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> >
> > Hi Eric,
> >
> > Do you have cycles to test a proposed fix patch developed by our team?
> >
> > The attached patch should apply (with "git am") for any recent kernel
> > that has the "tcp: fix to allow timestamp undo if no retransmits were
> > sent" patch it is fixing. So you should be able to test it on top of
> > the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> > easier.
> >
> > If you have cycles to rerun your iperf test, with  tcpdump, nstat, and
> > ss instrumentation, that would be fantastic!
> >
> > The patch passes our internal packetdrill test suite, including new
> > tests for this issue (based on the packetdrill scripts posted earlier
> > in this thread.
> >
> > But it would be fantastic to directly confirm that this fixes your issue.
>
> Hi Eric (Wheeler),
>
> Just checking: would you be able to test that patch (from my previous
> message) in your environment?
>
> If not, given that the patch fixes our packetdrill reproducers, we can
> send the patch to the list as-is without that testing.

Update on this thread; I sent the patch to the netdev list:

[PATCH net] tcp: fix tcp_packet_delayed() for
tcp_is_non_sack_preventing_reopen() behavior

https://lore.kernel.org/netdev/20250613193056.1585351-1-ncardwell.sw@gmail.com/

best,
neal

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
  2025-06-10 17:15             ` Neal Cardwell
  2025-06-12 18:23               ` Neal Cardwell
@ 2025-06-15 20:00               ` Eric Wheeler
  2025-06-16 20:13                 ` Eric Wheeler
  1 sibling, 1 reply; 19+ messages in thread
From: Eric Wheeler @ 2025-06-15 20:00 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
	Yuchung Cheng, stable

[-- Attachment #1: Type: text/plain, Size: 3014 bytes --]

On Tue, 10 Jun 2025, Neal Cardwell wrote:
> On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > >
> > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > >
> > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > >
> > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > >
> > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > >
> > > > > > > > Hello Neal,
> > > > > > > >
> > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > >
> > > > > > > > Interestingly, the problem only presents itself when transmitting
> > > > > > > > from Linux; receive traffic (to Linux) performs just fine:
> > > > > > > >         ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE  -> device
> > > > > > > >          ~1Gbit: device        =TX=>  1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > > > > > >
> > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > >
> > > > > > > >         tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> 
> Hi Eric,
> 
> Do you have cycles to test a proposed fix patch developed by our team?

Sorry for the radio silence, I just got back in town so I can do that 
later this week.  

> The attached patch should apply (with "git am") for any recent kernel
> that has the "tcp: fix to allow timestamp undo if no retransmits were
> sent" patch it is fixing. So you should be able to test it on top of
> the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> easier.

I can tested on top of 6.6-stable but I have to put a production system 
into standby in order to do that, so I will report back when I can, 
possibly as late as Friday 6/20 because the office is closed that day and 
I can work on it.
 
> If you have cycles to rerun your iperf test, with  tcpdump, nstat, and
> ss instrumentation, that would be fantastic!

will do 
 
> The patch passes our internal packetdrill test suite, including new
> tests for this issue (based on the packetdrill scripts posted earlier
> in this thread.

Awesome thank you for all of the effort to fix this!

-Eric

> 
> But it would be fantastic to directly confirm that this fixes your issue.
> 
> Thanks!
> neal
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
  2025-06-15 20:00               ` Eric Wheeler
@ 2025-06-16 20:13                 ` Eric Wheeler
  2025-06-16 21:07                   ` Neal Cardwell
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Wheeler @ 2025-06-16 20:13 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
	Yuchung Cheng, stable

[-- Attachment #1: Type: text/plain, Size: 6758 bytes --]

On Sun, 15 Jun 2025, Eric Wheeler wrote:
> On Tue, 10 Jun 2025, Neal Cardwell wrote:
> > On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> > >
> > > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > >
> > > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > >
> > > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > >
> > > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > >
> > > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > >
> > > > > > > > > Hello Neal,
> > > > > > > > >
> > > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > > >
> > > > > > > > > Interestingly, the problem only presents itself when transmitting
> > > > > > > > > from Linux; receive traffic (to Linux) performs just fine:
> > > > > > > > >         ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE  -> device
> > > > > > > > >          ~1Gbit: device        =TX=>  1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > > > > > > >
> > > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > > >
> > > > > > > > >         tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > 
> 
> > The attached patch should apply (with "git am") for any recent kernel
> > that has the "tcp: fix to allow timestamp undo if no retransmits were
> > sent" patch it is fixing. So you should be able to test it on top of
> > the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> > easier.

Definitely better, but performance is ~15% slower vs reverting, and the
retransmit counts are still higher than the other.  In the two sections
below you can see the difference between after the fix and after the
revert.  

Here is the output:

## After fixing with your patch:
	https://www.linuxglobal.com/out/for-neal/after-fix.tar.gz

	WHEN=after-fix
	(while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/$WHEN-ss.txt &
	nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done)  > /tmp/$WHEN-nstat.txt &
	tcpdump -i br0 -w /tmp/$WHEN-tcpdump.${eth}.pcap -n -s 116 -c 1000000 host 192.168.1.203 &
	iperf3 -c 192.168.1.203
	kill %1 %2 %3

	[1] 2300
	nstat: history is aged out, resetting
	[2] 2304
	[3] 2305
	Connecting to host 192.168.1.203, port 5201
	[  5] local 192.168.1.52 port 47730 connected to 192.168.1.203 port 5201
	dropped privs to tcpdump
	tcpdump: listening on br0, link-type EN10MB (Ethernet), snapshot length 116 bytes
	[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
	[  5]   0.00-1.00   sec   115 MBytes   963 Mbits/sec   21    334 KBytes       
	[  5]   1.00-2.00   sec   113 MBytes   949 Mbits/sec    3    325 KBytes       
	[  5]   2.00-3.00   sec  41.8 MBytes   350 Mbits/sec  216   5.70 KBytes       
	[  5]   3.00-4.00   sec   113 MBytes   952 Mbits/sec   77    234 KBytes       
	[  5]   4.00-5.00   sec   110 MBytes   927 Mbits/sec    5    281 KBytes       
	[  5]   5.00-6.00   sec  69.5 MBytes   583 Mbits/sec  129    336 KBytes       
	[  5]   6.00-7.00   sec  66.8 MBytes   561 Mbits/sec  234    302 KBytes       
	[  5]   7.00-8.00   sec   113 MBytes   949 Mbits/sec    8    312 KBytes       
	[  5]   8.00-9.00   sec  89.9 MBytes   754 Mbits/sec   72    247 KBytes       
	[  5]   9.00-10.00  sec   113 MBytes   949 Mbits/sec    6    235 KBytes       
	- - - - - - - - - - - - - - - - - - - - - - - - -
	[ ID] Interval           Transfer     Bitrate         Retr
	[  5]   0.00-10.00  sec   946 MBytes   794 Mbits/sec  771               sender <<<
	[  5]   0.00-10.04  sec   944 MBytes   789 Mbits/sec                  receiver <<<

	iperf Done.
	145337 packets captured
	146674 packets received by filter
	0 packets dropped by kernel
	[1]   Terminated              ( while true; do
	    date +%s.%N; ss -tenmoi; sleep 0.050;
	done ) > /tmp/$WHEN-ss.txt
	[root@hv2 ~]# 
	[2]-  Terminated              ( while true; do
	    date +%s.%N; nstat; sleep 0.050;
	done ) > /tmp/$WHEN-nstat.txt
	[3]+  Done                    tcpdump -i br0 -w /tmp/$WHEN-tcpdump.${eth}.pcap -n -s 116 -c 1000000 host 192.168.1.203

## After Revert
	WHEN=after-revert-6.6.93
	(while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/$WHEN-ss.txt &
	nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done)  > /tmp/$WHEN-nstat.txt &
	tcpdump -i br0 -w /tmp/$WHEN-tcpdump.${eth}.pcap -n -s 116 -c 1000000 host 192.168.1.203 &
	iperf3 -c 192.168.1.203
	kill %1 %2 %3
	[1] 2088
	nstat: history is aged out, resetting
	[2] 2092
	[3] 2093
	Connecting to host 192.168.1.203, port 5201
	dropped privs to tcpdump
	tcpdump: listening on br0, link-type EN10MB (Ethernet), snapshot length 116 bytes
	[  5] local 192.168.1.52 port 47256 connected to 192.168.1.203 port 5201
	[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
	[  5]   0.00-1.00   sec   115 MBytes   962 Mbits/sec   13    324 KBytes       
	[  5]   1.00-2.00   sec   114 MBytes   953 Mbits/sec    3    325 KBytes       
	[  5]   2.00-3.00   sec   113 MBytes   947 Mbits/sec    4    321 KBytes       
	[  5]   3.00-4.00   sec   113 MBytes   950 Mbits/sec    3    321 KBytes       
	[  5]   4.00-5.00   sec   113 MBytes   946 Mbits/sec    5    322 KBytes       
	[  5]   5.00-6.00   sec   113 MBytes   950 Mbits/sec    8    321 KBytes       
	[  5]   6.00-7.00   sec   113 MBytes   948 Mbits/sec    5    312 KBytes       
	[  5]   7.00-8.00   sec   113 MBytes   952 Mbits/sec    3    301 KBytes       
	[  5]   8.00-9.00   sec   113 MBytes   945 Mbits/sec    7    301 KBytes       
	[  5]   9.00-10.00  sec   114 MBytes   953 Mbits/sec    4    302 KBytes       
	- - - - - - - - - - - - - - - - - - - - - - - - -
	[ ID] Interval           Transfer     Bitrate         Retr
	[  5]   0.00-10.00  sec  1.11 GBytes   950 Mbits/sec   55             sender
	[  5]   0.00-10.04  sec  1.10 GBytes   945 Mbits/sec                  receiver

	iperf Done.
	[root@hv2 ~]# 189249 packets captured
	189450 packets received by filter
	0 packets dropped by kernel


--
Eric Wheeler

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
  2025-06-16 20:13                 ` Eric Wheeler
@ 2025-06-16 21:07                   ` Neal Cardwell
  2025-06-18 22:03                     ` Eric Wheeler
  0 siblings, 1 reply; 19+ messages in thread
From: Neal Cardwell @ 2025-06-16 21:07 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
	Yuchung Cheng, stable

On Mon, Jun 16, 2025 at 4:14 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
>
> On Sun, 15 Jun 2025, Eric Wheeler wrote:
> > On Tue, 10 Jun 2025, Neal Cardwell wrote:
> > > On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > >
> > > > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > >
> > > > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > >
> > > > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > >
> > > > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > >
> > > > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > >
> > > > > > > > > > Hello Neal,
> > > > > > > > > >
> > > > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > > > >
> > > > > > > > > > Interestingly, the problem only presents itself when transmitting
> > > > > > > > > > from Linux; receive traffic (to Linux) performs just fine:
> > > > > > > > > >         ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE  -> device
> > > > > > > > > >          ~1Gbit: device        =TX=>  1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > > > > > > > >
> > > > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > > > >
> > > > > > > > > >         tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > > >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > > >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > >
> >
> > > The attached patch should apply (with "git am") for any recent kernel
> > > that has the "tcp: fix to allow timestamp undo if no retransmits were
> > > sent" patch it is fixing. So you should be able to test it on top of
> > > the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> > > easier.
>
> Definitely better, but performance is ~15% slower vs reverting, and the
> retransmit counts are still higher than the other.  In the two sections
> below you can see the difference between after the fix and after the
> revert.
>
> Here is the output:
>
> ## After fixing with your patch:
>         https://www.linuxglobal.com/out/for-neal/after-fix.tar.gz
>
>         WHEN=after-fix
>         (while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/$WHEN-ss.txt &
>         nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done)  > /tmp/$WHEN-nstat.txt &
>         tcpdump -i br0 -w /tmp/$WHEN-tcpdump.${eth}.pcap -n -s 116 -c 1000000 host 192.168.1.203 &
>         iperf3 -c 192.168.1.203
>         kill %1 %2 %3
>
>         [1] 2300
>         nstat: history is aged out, resetting
>         [2] 2304
>         [3] 2305
>         Connecting to host 192.168.1.203, port 5201
>         [  5] local 192.168.1.52 port 47730 connected to 192.168.1.203 port 5201
>         dropped privs to tcpdump
>         tcpdump: listening on br0, link-type EN10MB (Ethernet), snapshot length 116 bytes
>         [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
>         [  5]   0.00-1.00   sec   115 MBytes   963 Mbits/sec   21    334 KBytes
>         [  5]   1.00-2.00   sec   113 MBytes   949 Mbits/sec    3    325 KBytes
>         [  5]   2.00-3.00   sec  41.8 MBytes   350 Mbits/sec  216   5.70 KBytes
>         [  5]   3.00-4.00   sec   113 MBytes   952 Mbits/sec   77    234 KBytes
>         [  5]   4.00-5.00   sec   110 MBytes   927 Mbits/sec    5    281 KBytes
>         [  5]   5.00-6.00   sec  69.5 MBytes   583 Mbits/sec  129    336 KBytes
>         [  5]   6.00-7.00   sec  66.8 MBytes   561 Mbits/sec  234    302 KBytes
>         [  5]   7.00-8.00   sec   113 MBytes   949 Mbits/sec    8    312 KBytes
>         [  5]   8.00-9.00   sec  89.9 MBytes   754 Mbits/sec   72    247 KBytes
>         [  5]   9.00-10.00  sec   113 MBytes   949 Mbits/sec    6    235 KBytes
>         - - - - - - - - - - - - - - - - - - - - - - - - -
>         [ ID] Interval           Transfer     Bitrate         Retr
>         [  5]   0.00-10.00  sec   946 MBytes   794 Mbits/sec  771               sender <<<
>         [  5]   0.00-10.04  sec   944 MBytes   789 Mbits/sec                  receiver <<<
>
>         iperf Done.
>         145337 packets captured
>         146674 packets received by filter
>         0 packets dropped by kernel
>         [1]   Terminated              ( while true; do
>             date +%s.%N; ss -tenmoi; sleep 0.050;
>         done ) > /tmp/$WHEN-ss.txt
>         [root@hv2 ~]#
>         [2]-  Terminated              ( while true; do
>             date +%s.%N; nstat; sleep 0.050;
>         done ) > /tmp/$WHEN-nstat.txt
>         [3]+  Done                    tcpdump -i br0 -w /tmp/$WHEN-tcpdump.${eth}.pcap -n -s 116 -c 1000000 host 192.168.1.203
>
> ## After Revert
>         WHEN=after-revert-6.6.93
>         (while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/$WHEN-ss.txt &
>         nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done)  > /tmp/$WHEN-nstat.txt &
>         tcpdump -i br0 -w /tmp/$WHEN-tcpdump.${eth}.pcap -n -s 116 -c 1000000 host 192.168.1.203 &
>         iperf3 -c 192.168.1.203
>         kill %1 %2 %3
>         [1] 2088
>         nstat: history is aged out, resetting
>         [2] 2092
>         [3] 2093
>         Connecting to host 192.168.1.203, port 5201
>         dropped privs to tcpdump
>         tcpdump: listening on br0, link-type EN10MB (Ethernet), snapshot length 116 bytes
>         [  5] local 192.168.1.52 port 47256 connected to 192.168.1.203 port 5201
>         [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
>         [  5]   0.00-1.00   sec   115 MBytes   962 Mbits/sec   13    324 KBytes
>         [  5]   1.00-2.00   sec   114 MBytes   953 Mbits/sec    3    325 KBytes
>         [  5]   2.00-3.00   sec   113 MBytes   947 Mbits/sec    4    321 KBytes
>         [  5]   3.00-4.00   sec   113 MBytes   950 Mbits/sec    3    321 KBytes
>         [  5]   4.00-5.00   sec   113 MBytes   946 Mbits/sec    5    322 KBytes
>         [  5]   5.00-6.00   sec   113 MBytes   950 Mbits/sec    8    321 KBytes
>         [  5]   6.00-7.00   sec   113 MBytes   948 Mbits/sec    5    312 KBytes
>         [  5]   7.00-8.00   sec   113 MBytes   952 Mbits/sec    3    301 KBytes
>         [  5]   8.00-9.00   sec   113 MBytes   945 Mbits/sec    7    301 KBytes
>         [  5]   9.00-10.00  sec   114 MBytes   953 Mbits/sec    4    302 KBytes
>         - - - - - - - - - - - - - - - - - - - - - - - - -
>         [ ID] Interval           Transfer     Bitrate         Retr
>         [  5]   0.00-10.00  sec  1.11 GBytes   950 Mbits/sec   55             sender
>         [  5]   0.00-10.04  sec  1.10 GBytes   945 Mbits/sec                  receiver
>
>         iperf Done.
>         [root@hv2 ~]# 189249 packets captured
>         189450 packets received by filter
>         0 packets dropped by kernel

Thanks for the test data!

Looking at the traces, there are no undo events, and no spurious loss
recovery events that I can see. So I don't see how the fix patch,
which changes undo behavior, would be relevant to the performance in
the test. It looks to me like the "after-fix" test just got unlucky
with packet losses, and because the receiver does not have SACK
support, any bad luck can easily turn into very poor performance, with
200ms timeouts during fast recovery.

Would you have cycles to run the "after-fix" and "after-revert-6.6.93"
cases multiple times, so we can get a sense of what is signal and what
is noise? Perhaps 20 or 50 trials for each approach?

Thanks!
neal

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
  2025-06-16 21:07                   ` Neal Cardwell
@ 2025-06-18 22:03                     ` Eric Wheeler
  2025-06-25 19:17                       ` Eric Wheeler
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Wheeler @ 2025-06-18 22:03 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
	Yuchung Cheng, stable

[-- Attachment #1: Type: text/plain, Size: 4954 bytes --]

On Mon, 16 Jun 2025, Neal Cardwell wrote:

> On Mon, Jun 16, 2025 at 4:14 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> >
> > On Sun, 15 Jun 2025, Eric Wheeler wrote:
> > > On Tue, 10 Jun 2025, Neal Cardwell wrote:
> > > > On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > >
> > > > > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > >
> > > > > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > >
> > > > > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hello Neal,
> > > > > > > > > > >
> > > > > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > > > > >
> > > > > > > > > > > Interestingly, the problem only presents itself when transmitting
> > > > > > > > > > > from Linux; receive traffic (to Linux) performs just fine:
> > > > > > > > > > >         ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE  -> device
> > > > > > > > > > >          ~1Gbit: device        =TX=>  1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > > > > > > > > >
> > > > > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > > > > >
> > > > > > > > > > >         tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > > > >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > > > >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > >
> > >
> > > > The attached patch should apply (with "git am") for any recent kernel
> > > > that has the "tcp: fix to allow timestamp undo if no retransmits were
> > > > sent" patch it is fixing. So you should be able to test it on top of
> > > > the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> > > > easier.
> >
> > Definitely better, but performance is ~15% slower vs reverting, and the
> > retransmit counts are still higher than the other.  In the two sections
> > below you can see the difference between after the fix and after the
> > revert.
> >
> > Here is the output:
> >
> > ## After fixing with your patch:
> >         - - - - - - - - - - - - - - - - - - - - - - - - -
> >         [ ID] Interval           Transfer     Bitrate         Retr
> >         [  5]   0.00-10.00  sec   946 MBytes   794 Mbits/sec  771               sender <<<
> >         [  5]   0.00-10.04  sec   944 MBytes   789 Mbits/sec                  receiver <<<
> >
> > ## After Revert
> >         - - - - - - - - - - - - - - - - - - - - - - - - -
> >         [ ID] Interval           Transfer     Bitrate         Retr
> >         [  5]   0.00-10.00  sec  1.11 GBytes   950 Mbits/sec   55             sender
> >         [  5]   0.00-10.04  sec  1.10 GBytes   945 Mbits/sec                  receiver
> 
> Thanks for the test data!
> 
> Looking at the traces, there are no undo events, and no spurious loss
> recovery events that I can see. So I don't see how the fix patch,
> which changes undo behavior, would be relevant to the performance in
> the test. It looks to me like the "after-fix" test just got unlucky
> with packet losses, and because the receiver does not have SACK
> support, any bad luck can easily turn into very poor performance, with
> 200ms timeouts during fast recovery.
> 
> Would you have cycles to run the "after-fix" and "after-revert-6.6.93"
> cases multiple times, so we can get a sense of what is signal and what
> is noise? Perhaps 20 or 50 trials for each approach?
 
I ran 50 tests after revert and compare that to after the fix using both
average and geometric mean, and it still appears to be slightly slower
then with the revert alone:

	# after-revert-6.6.93    
	Arithmetic Mean: 843.64 Mbits/sec
	Geometric Mean: 841.95 Mbits/sec

	# after-tcp-fix-6.6.93    
	Arithmetic Mean: 823.00 Mbits/sec
	Geometric Mean: 819.38 Mbits/sec

Do you think that this is an actual performance regression, or just a
sample set that is not big enough to work out the averages?

Here is the data collected for each of the 50 tests:
	- https://www.linuxglobal.com/out/for-neal/after-revert-6.6.93.tar.gz
	- https://www.linuxglobal.com/out/for-neal/after-tcp-fix-6.6.93.tar.gz


--
Eric Wheeler


> Thanks!
> neal
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
  2025-06-18 22:03                     ` Eric Wheeler
@ 2025-06-25 19:17                       ` Eric Wheeler
  2025-06-25 20:19                         ` Neal Cardwell
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Wheeler @ 2025-06-25 19:17 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
	Yuchung Cheng, stable

[-- Attachment #1: Type: text/plain, Size: 3390 bytes --]

On Wed, 18 Jun 2025, Eric Wheeler wrote:
> On Mon, 16 Jun 2025, Neal Cardwell wrote:
> > On Mon, Jun 16, 2025 at 4:14 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > On Sun, 15 Jun 2025, Eric Wheeler wrote:
> > > > On Tue, 10 Jun 2025, Neal Cardwell wrote:
> > > > > On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > > > > > >
> > > > > > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > > > > > >
> > > > > > > > > > > >         tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > > > > >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > > > > >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > > >
> > > >
> > > > > The attached patch should apply (with "git am") for any recent kernel
> > > > > that has the "tcp: fix to allow timestamp undo if no retransmits were
> > > > > sent" patch it is fixing. So you should be able to test it on top of
> > > > > the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> > > > > easier.
> > >
> > > Definitely better, but performance is ~15% slower vs reverting, and the
> > > retransmit counts are still higher than the other.  In the two sections
> > > below you can see the difference between after the fix and after the
> > > revert.
> > >
> >
> > Would you have cycles to run the "after-fix" and "after-revert-6.6.93"
> > cases multiple times, so we can get a sense of what is signal and what
> > is noise? Perhaps 20 or 50 trials for each approach?
>  
> I ran 50 tests after revert and compare that to after the fix using both
> average and geometric mean, and it still appears to be slightly slower
> then with the revert alone:
> 
> 	# after-revert-6.6.93    
> 	Arithmetic Mean: 843.64 Mbits/sec
> 	Geometric Mean: 841.95 Mbits/sec
> 
> 	# after-tcp-fix-6.6.93    
> 	Arithmetic Mean: 823.00 Mbits/sec
> 	Geometric Mean: 819.38 Mbits/sec
> 

Re-sending this question in case this message got lost:

> Do you think that this is an actual performance regression, or just a
> sample set that is not big enough to work out the averages?
> 
> Here is the data collected for each of the 50 tests:
> 	- https://www.linuxglobal.com/out/for-neal/after-revert-6.6.93.tar.gz
> 	- https://www.linuxglobal.com/out/for-neal/after-tcp-fix-6.6.93.tar.gz



--
Eric Wheeler

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
  2025-06-25 19:17                       ` Eric Wheeler
@ 2025-06-25 20:19                         ` Neal Cardwell
  2025-06-25 23:15                           ` Eric Wheeler
  0 siblings, 1 reply; 19+ messages in thread
From: Neal Cardwell @ 2025-06-25 20:19 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
	Yuchung Cheng, stable

On Wed, Jun 25, 2025 at 3:17 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
>
> On Wed, 18 Jun 2025, Eric Wheeler wrote:
> > On Mon, 16 Jun 2025, Neal Cardwell wrote:
> > > On Mon, Jun 16, 2025 at 4:14 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > On Sun, 15 Jun 2025, Eric Wheeler wrote:
> > > > > On Tue, 10 Jun 2025, Neal Cardwell wrote:
> > > > > > On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > > > > > > >
> > > > > > > > > > > > >         tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > > > > > >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > > > > > >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > > > >
> > > > >
> > > > > > The attached patch should apply (with "git am") for any recent kernel
> > > > > > that has the "tcp: fix to allow timestamp undo if no retransmits were
> > > > > > sent" patch it is fixing. So you should be able to test it on top of
> > > > > > the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> > > > > > easier.
> > > >
> > > > Definitely better, but performance is ~15% slower vs reverting, and the
> > > > retransmit counts are still higher than the other.  In the two sections
> > > > below you can see the difference between after the fix and after the
> > > > revert.
> > > >
> > >
> > > Would you have cycles to run the "after-fix" and "after-revert-6.6.93"
> > > cases multiple times, so we can get a sense of what is signal and what
> > > is noise? Perhaps 20 or 50 trials for each approach?
> >
> > I ran 50 tests after revert and compare that to after the fix using both
> > average and geometric mean, and it still appears to be slightly slower
> > then with the revert alone:
> >
> >       # after-revert-6.6.93
> >       Arithmetic Mean: 843.64 Mbits/sec
> >       Geometric Mean: 841.95 Mbits/sec
> >
> >       # after-tcp-fix-6.6.93
> >       Arithmetic Mean: 823.00 Mbits/sec
> >       Geometric Mean: 819.38 Mbits/sec
> >
>
> Re-sending this question in case this message got lost:
>
> > Do you think that this is an actual performance regression, or just a
> > sample set that is not big enough to work out the averages?
> >
> > Here is the data collected for each of the 50 tests:
> >       - https://www.linuxglobal.com/out/for-neal/after-revert-6.6.93.tar.gz
> >       - https://www.linuxglobal.com/out/for-neal/after-tcp-fix-6.6.93.tar.gz

Hi Eric,

Many thanks for this great data!

I have been looking at this data. It's quite interesting.

Looking at the CDF of throughputs for the "revert" cases vs the "fix"
cases (attached) it does look like for the 70-th percentile and below
(the 70% of most unlucky cases), the "fix" cases have a throughput
that is lower, and IMHO this looks outside the realm of what we would
expect from noise.

However, when I look at the traces, I don't see any reason why the
"fix" cases would be systematically slower. In particular, the "fix"
and "revert" cases are only changing a function used for "undo"
decisions, but for both the "fix" or "revert" cases, there are no
"undo" events, and I don't see cases with spurious retransmissions
where there should have been "undo" events and yet there were not.

Visually inspecting the traces, the dominant determinant of
performance seems to be how many RTO events there were. For example,
the worst case for the "fix" trials has 16 RTOs, whereas the worst
case for the "revert" trials has 13 RTOs. And the number of RTO events
per trial looks random; I see similar qualitative patterns between
"fix" and "revert" cases, and don't see any reason why there are more
RTOs in the "fix" cases than the "revert" cases. All the RTOs seem to
be due to pre-existing (longstanding) performance problems in non-SACK
loss recovery.

One way to proceed would be for me to offer some performance fixes for
the RTOs, so we can get rid of the RTOs, which are the biggest source
of performance variation. That should greatly reduce noise, and
perhaps make it easier to see if there is any real difference between
"fix" and "revert" cases.

We could compare the following two kernels, with another 50 tests for
each of two kernels:

+ (a) 6.6.93 + {2 patches to fix RTOs} + "revert"
+ (b) 6.6.93 + {2 patches to fix RTOs} + "fix"

where:

"revert" =  revert e37ab7373696 ("tcp: fix to allow timestamp undo if
no retransmits were sent")
"fix" = apply d0fa59897e04 ("tcp: fix tcp_packet_delayed() for
tcp_is_non_sack_preventing_reopen() behavior"

This would have the side benefit of testing some performance
improvements for non-SACK connections.

Are you up for that? :-)

Best regards,
neal

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
  2025-06-25 20:19                         ` Neal Cardwell
@ 2025-06-25 23:15                           ` Eric Wheeler
  2025-06-26 14:21                             ` Neal Cardwell
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Wheeler @ 2025-06-25 23:15 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
	Yuchung Cheng, stable

[-- Attachment #1: Type: text/plain, Size: 6339 bytes --]

On Wed, 25 Jun 2025, Neal Cardwell wrote:
> On Wed, Jun 25, 2025 at 3:17 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> >
> > On Wed, 18 Jun 2025, Eric Wheeler wrote:
> > > On Mon, 16 Jun 2025, Neal Cardwell wrote:
> > > > On Mon, Jun 16, 2025 at 4:14 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > On Sun, 15 Jun 2025, Eric Wheeler wrote:
> > > > > > On Tue, 10 Jun 2025, Neal Cardwell wrote:
> > > > > > > On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >         tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > > > > > > >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > > > > > > >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > > > > >
> > > > > >
> > > > > > > The attached patch should apply (with "git am") for any recent kernel
> > > > > > > that has the "tcp: fix to allow timestamp undo if no retransmits were
> > > > > > > sent" patch it is fixing. So you should be able to test it on top of
> > > > > > > the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> > > > > > > easier.
> > > > >
> > > > > Definitely better, but performance is ~15% slower vs reverting, and the
> > > > > retransmit counts are still higher than the other.  In the two sections
> > > > > below you can see the difference between after the fix and after the
> > > > > revert.
> > > > >
> > > >
> > > > Would you have cycles to run the "after-fix" and "after-revert-6.6.93"
> > > > cases multiple times, so we can get a sense of what is signal and what
> > > > is noise? Perhaps 20 or 50 trials for each approach?
> > >
> > > I ran 50 tests after revert and compare that to after the fix using both
> > > average and geometric mean, and it still appears to be slightly slower
> > > then with the revert alone:
> > >
> > >       # after-revert-6.6.93
> > >       Arithmetic Mean: 843.64 Mbits/sec
> > >       Geometric Mean: 841.95 Mbits/sec
> > >
> > >       # after-tcp-fix-6.6.93
> > >       Arithmetic Mean: 823.00 Mbits/sec
> > >       Geometric Mean: 819.38 Mbits/sec
> > >
> >
> > Re-sending this question in case this message got lost:
> >
> > > Do you think that this is an actual performance regression, or just a
> > > sample set that is not big enough to work out the averages?
> > >
> > > Here is the data collected for each of the 50 tests:
> > >       - https://www.linuxglobal.com/out/for-neal/after-revert-6.6.93.tar.gz
> > >       - https://www.linuxglobal.com/out/for-neal/after-tcp-fix-6.6.93.tar.gz
> 
> Hi Eric,
> 
> Many thanks for this great data!
> 
> I have been looking at this data. It's quite interesting.
> 
> Looking at the CDF of throughputs for the "revert" cases vs the "fix"
> cases (attached) it does look like for the 70-th percentile and below
> (the 70% of most unlucky cases), the "fix" cases have a throughput
> that is lower, and IMHO this looks outside the realm of what we would
> expect from noise.
> 
> However, when I look at the traces, I don't see any reason why the
> "fix" cases would be systematically slower. In particular, the "fix"
> and "revert" cases are only changing a function used for "undo"
> decisions, but for both the "fix" or "revert" cases, there are no
> "undo" events, and I don't see cases with spurious retransmissions
> where there should have been "undo" events and yet there were not.
> 
> Visually inspecting the traces, the dominant determinant of
> performance seems to be how many RTO events there were. For example,
> the worst case for the "fix" trials has 16 RTOs, whereas the worst
> case for the "revert" trials has 13 RTOs. And the number of RTO events
> per trial looks random; I see similar qualitative patterns between
> "fix" and "revert" cases, and don't see any reason why there are more
> RTOs in the "fix" cases than the "revert" cases. All the RTOs seem to
> be due to pre-existing (longstanding) performance problems in non-SACK
> loss recovery.
> 
> One way to proceed would be for me to offer some performance fixes for
> the RTOs, so we can get rid of the RTOs, which are the biggest source
> of performance variation. That should greatly reduce noise, and
> perhaps make it easier to see if there is any real difference between
> "fix" and "revert" cases.
> 
> We could compare the following two kernels, with another 50 tests for
> each of two kernels:
> 
> + (a) 6.6.93 + {2 patches to fix RTOs} + "revert"
> + (b) 6.6.93 + {2 patches to fix RTOs} + "fix"
> 
> where:
> 
> "revert" =  revert e37ab7373696 ("tcp: fix to allow timestamp undo if
> no retransmits were sent")
> "fix" = apply d0fa59897e04 ("tcp: fix tcp_packet_delayed() for
> tcp_is_non_sack_preventing_reopen() behavior"
> 
> This would have the side benefit of testing some performance
> improvements for non-SACK connections.
> 
> Are you up for that? :-)


Sure, if you have some patch ideas in mind, I'm all for getting patches 
merged improve performance.  

BTW, what causes a non-SACK connection?  The RX side is a near-idle Linux 
6.8 host default sysctl settings.


--
Eric Wheeler


> 
> Best regards,
> neal
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
  2025-06-25 23:15                           ` Eric Wheeler
@ 2025-06-26 14:21                             ` Neal Cardwell
  2025-06-26 20:16                               ` Eric Wheeler
  0 siblings, 1 reply; 19+ messages in thread
From: Neal Cardwell @ 2025-06-26 14:21 UTC (permalink / raw)
  To: Eric Wheeler
  Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
	Yuchung Cheng, stable

On Wed, Jun 25, 2025 at 7:15 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
>
> On Wed, 25 Jun 2025, Neal Cardwell wrote:
> > On Wed, Jun 25, 2025 at 3:17 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > >
> > > On Wed, 18 Jun 2025, Eric Wheeler wrote:
> > > > On Mon, 16 Jun 2025, Neal Cardwell wrote:
> > > > > On Mon, Jun 16, 2025 at 4:14 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > On Sun, 15 Jun 2025, Eric Wheeler wrote:
> > > > > > > On Tue, 10 Jun 2025, Neal Cardwell wrote:
> > > > > > > > On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > > > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >         tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > > > > > > > >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > > > > > > > >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > > > > > >
> > > > > > >
> > > > > > > > The attached patch should apply (with "git am") for any recent kernel
> > > > > > > > that has the "tcp: fix to allow timestamp undo if no retransmits were
> > > > > > > > sent" patch it is fixing. So you should be able to test it on top of
> > > > > > > > the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> > > > > > > > easier.
> > > > > >
> > > > > > Definitely better, but performance is ~15% slower vs reverting, and the
> > > > > > retransmit counts are still higher than the other.  In the two sections
> > > > > > below you can see the difference between after the fix and after the
> > > > > > revert.
> > > > > >
> > > > >
> > > > > Would you have cycles to run the "after-fix" and "after-revert-6.6.93"
> > > > > cases multiple times, so we can get a sense of what is signal and what
> > > > > is noise? Perhaps 20 or 50 trials for each approach?
> > > >
> > > > I ran 50 tests after revert and compare that to after the fix using both
> > > > average and geometric mean, and it still appears to be slightly slower
> > > > then with the revert alone:
> > > >
> > > >       # after-revert-6.6.93
> > > >       Arithmetic Mean: 843.64 Mbits/sec
> > > >       Geometric Mean: 841.95 Mbits/sec
> > > >
> > > >       # after-tcp-fix-6.6.93
> > > >       Arithmetic Mean: 823.00 Mbits/sec
> > > >       Geometric Mean: 819.38 Mbits/sec
> > > >
> > >
> > > Re-sending this question in case this message got lost:
> > >
> > > > Do you think that this is an actual performance regression, or just a
> > > > sample set that is not big enough to work out the averages?
> > > >
> > > > Here is the data collected for each of the 50 tests:
> > > >       - https://www.linuxglobal.com/out/for-neal/after-revert-6.6.93.tar.gz
> > > >       - https://www.linuxglobal.com/out/for-neal/after-tcp-fix-6.6.93.tar.gz
> >
> > Hi Eric,
> >
> > Many thanks for this great data!
> >
> > I have been looking at this data. It's quite interesting.
> >
> > Looking at the CDF of throughputs for the "revert" cases vs the "fix"
> > cases (attached) it does look like for the 70-th percentile and below
> > (the 70% of most unlucky cases), the "fix" cases have a throughput
> > that is lower, and IMHO this looks outside the realm of what we would
> > expect from noise.
> >
> > However, when I look at the traces, I don't see any reason why the
> > "fix" cases would be systematically slower. In particular, the "fix"
> > and "revert" cases are only changing a function used for "undo"
> > decisions, but for both the "fix" or "revert" cases, there are no
> > "undo" events, and I don't see cases with spurious retransmissions
> > where there should have been "undo" events and yet there were not.
> >
> > Visually inspecting the traces, the dominant determinant of
> > performance seems to be how many RTO events there were. For example,
> > the worst case for the "fix" trials has 16 RTOs, whereas the worst
> > case for the "revert" trials has 13 RTOs. And the number of RTO events
> > per trial looks random; I see similar qualitative patterns between
> > "fix" and "revert" cases, and don't see any reason why there are more
> > RTOs in the "fix" cases than the "revert" cases. All the RTOs seem to
> > be due to pre-existing (longstanding) performance problems in non-SACK
> > loss recovery.
> >
> > One way to proceed would be for me to offer some performance fixes for
> > the RTOs, so we can get rid of the RTOs, which are the biggest source
> > of performance variation. That should greatly reduce noise, and
> > perhaps make it easier to see if there is any real difference between
> > "fix" and "revert" cases.
> >
> > We could compare the following two kernels, with another 50 tests for
> > each of two kernels:
> >
> > + (a) 6.6.93 + {2 patches to fix RTOs} + "revert"
> > + (b) 6.6.93 + {2 patches to fix RTOs} + "fix"
> >
> > where:
> >
> > "revert" =  revert e37ab7373696 ("tcp: fix to allow timestamp undo if
> > no retransmits were sent")
> > "fix" = apply d0fa59897e04 ("tcp: fix tcp_packet_delayed() for
> > tcp_is_non_sack_preventing_reopen() behavior"
> >
> > This would have the side benefit of testing some performance
> > improvements for non-SACK connections.
> >
> > Are you up for that? :-)
>
>
> Sure, if you have some patch ideas in mind, I'm all for getting patches
> merged improve performance.

Great! Thanks for being willing to do this! I will try to post some
patches ASAP.

> BTW, what causes a non-SACK connection?  The RX side is a near-idle Linux
> 6.8 host default sysctl settings.

Given the RX side is a Linux 6.8 host, the kernel should be supporting
TCP SACK due to kernel compile-time defaults (see the
"net->ipv4.sysctl_tcp_sack = 1;" in net/ipv4/tcp_ipv4.c.

Given that factor, off-hand, I can think of only a few reasons why the
RX side would not negotiate SACK support:

(1) Some script or software on the RX machine has disabled SACK via
"sysctl net.ipv4.tcp_sack=0" or equivalent, perhaps at boot time (this
is easy to check with "sysctl net.ipv4.tcp_sack").

(2) There is a middlebox on the path (doing firewalling or NAT, etc)
that disables SACK

(3) There is a firewall rule on some machine or router/switch that disables SACK

Off-hand, I would think that (2) is the most likely case, since
intentionally disabling SACK via sysctl or firewall rule is
inadvisable and rare.

Any thoughts on which of these might be in play here?

Thanks,
neal

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
  2025-06-26 14:21                             ` Neal Cardwell
@ 2025-06-26 20:16                               ` Eric Wheeler
  0 siblings, 0 replies; 19+ messages in thread
From: Eric Wheeler @ 2025-06-26 20:16 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
	Yuchung Cheng, stable

[-- Attachment #1: Type: text/plain, Size: 9075 bytes --]

On Thu, 26 Jun 2025, Neal Cardwell wrote:
> On Wed, Jun 25, 2025 at 7:15 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> >
> > On Wed, 25 Jun 2025, Neal Cardwell wrote:
> > > On Wed, Jun 25, 2025 at 3:17 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > >
> > > > On Wed, 18 Jun 2025, Eric Wheeler wrote:
> > > > > On Mon, 16 Jun 2025, Neal Cardwell wrote:
> > > > > > On Mon, Jun 16, 2025 at 4:14 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > On Sun, 15 Jun 2025, Eric Wheeler wrote:
> > > > > > > > On Tue, 10 Jun 2025, Neal Cardwell wrote:
> > > > > > > > > On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > > > > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > > > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > > > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > > > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > > > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >         tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > > > > > > > > >                 upstream:       e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > > > > > > > > >                 stable 6.6.y:   e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > > > > > > >
> > > > > > > >
> > > > > > > > > The attached patch should apply (with "git am") for any recent kernel
> > > > > > > > > that has the "tcp: fix to allow timestamp undo if no retransmits were
> > > > > > > > > sent" patch it is fixing. So you should be able to test it on top of
> > > > > > > > > the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> > > > > > > > > easier.
> > > > > > >
> > > > > > > Definitely better, but performance is ~15% slower vs reverting, and the
> > > > > > > retransmit counts are still higher than the other.  In the two sections
> > > > > > > below you can see the difference between after the fix and after the
> > > > > > > revert.
> > > > > > >
> > > > > >
> > > > > > Would you have cycles to run the "after-fix" and "after-revert-6.6.93"
> > > > > > cases multiple times, so we can get a sense of what is signal and what
> > > > > > is noise? Perhaps 20 or 50 trials for each approach?
> > > > >
> > > > > I ran 50 tests after revert and compare that to after the fix using both
> > > > > average and geometric mean, and it still appears to be slightly slower
> > > > > then with the revert alone:
> > > > >
> > > > >       # after-revert-6.6.93
> > > > >       Arithmetic Mean: 843.64 Mbits/sec
> > > > >       Geometric Mean: 841.95 Mbits/sec
> > > > >
> > > > >       # after-tcp-fix-6.6.93
> > > > >       Arithmetic Mean: 823.00 Mbits/sec
> > > > >       Geometric Mean: 819.38 Mbits/sec
> > > > >
> > > >
> > > > Re-sending this question in case this message got lost:
> > > >
> > > > > Do you think that this is an actual performance regression, or just a
> > > > > sample set that is not big enough to work out the averages?
> > > > >
> > > > > Here is the data collected for each of the 50 tests:
> > > > >       - https://www.linuxglobal.com/out/for-neal/after-revert-6.6.93.tar.gz
> > > > >       - https://www.linuxglobal.com/out/for-neal/after-tcp-fix-6.6.93.tar.gz
> > >
> > > Hi Eric,
> > >
> > > Many thanks for this great data!
> > >
> > > I have been looking at this data. It's quite interesting.
> > >
> > > Looking at the CDF of throughputs for the "revert" cases vs the "fix"
> > > cases (attached) it does look like for the 70-th percentile and below
> > > (the 70% of most unlucky cases), the "fix" cases have a throughput
> > > that is lower, and IMHO this looks outside the realm of what we would
> > > expect from noise.
> > >
> > > However, when I look at the traces, I don't see any reason why the
> > > "fix" cases would be systematically slower. In particular, the "fix"
> > > and "revert" cases are only changing a function used for "undo"
> > > decisions, but for both the "fix" or "revert" cases, there are no
> > > "undo" events, and I don't see cases with spurious retransmissions
> > > where there should have been "undo" events and yet there were not.
> > >
> > > Visually inspecting the traces, the dominant determinant of
> > > performance seems to be how many RTO events there were. For example,
> > > the worst case for the "fix" trials has 16 RTOs, whereas the worst
> > > case for the "revert" trials has 13 RTOs. And the number of RTO events
> > > per trial looks random; I see similar qualitative patterns between
> > > "fix" and "revert" cases, and don't see any reason why there are more
> > > RTOs in the "fix" cases than the "revert" cases. All the RTOs seem to
> > > be due to pre-existing (longstanding) performance problems in non-SACK
> > > loss recovery.
> > >
> > > One way to proceed would be for me to offer some performance fixes for
> > > the RTOs, so we can get rid of the RTOs, which are the biggest source
> > > of performance variation. That should greatly reduce noise, and
> > > perhaps make it easier to see if there is any real difference between
> > > "fix" and "revert" cases.
> > >
> > > We could compare the following two kernels, with another 50 tests for
> > > each of two kernels:
> > >
> > > + (a) 6.6.93 + {2 patches to fix RTOs} + "revert"
> > > + (b) 6.6.93 + {2 patches to fix RTOs} + "fix"
> > >
> > > where:
> > >
> > > "revert" =  revert e37ab7373696 ("tcp: fix to allow timestamp undo if
> > > no retransmits were sent")
> > > "fix" = apply d0fa59897e04 ("tcp: fix tcp_packet_delayed() for
> > > tcp_is_non_sack_preventing_reopen() behavior"
> > >
> > > This would have the side benefit of testing some performance
> > > improvements for non-SACK connections.
> > >
> > > Are you up for that? :-)
> >
> >
> > Sure, if you have some patch ideas in mind, I'm all for getting patches
> > merged improve performance.
> 
> Great! Thanks for being willing to do this! I will try to post some
> patches ASAP.
> 
> > BTW, what causes a non-SACK connection?  The RX side is a near-idle Linux
> > 6.8 host default sysctl settings.
> 
> Given the RX side is a Linux 6.8 host, the kernel should be supporting
> TCP SACK due to kernel compile-time defaults (see the
> "net->ipv4.sysctl_tcp_sack = 1;" in net/ipv4/tcp_ipv4.c.
>
> Given that factor, off-hand, I can think of only a few reasons why the
> RX side would not negotiate SACK support:
> 
> (1) Some script or software on the RX machine has disabled SACK via
> "sysctl net.ipv4.tcp_sack=0" or equivalent, perhaps at boot time (this
> is easy to check with "sysctl net.ipv4.tcp_sack").


It looks like you are right:

	# cat /proc/sys/net/ipv4/tcp_sack 
	0

and it runs way faster after turning it on:

	~]# iperf3 -c 192.168.1.203
	Connecting to host 192.168.1.203, port 5201
	[  5] local 192.168.1.52 port 55104 connected to 192.168.1.203 port 5201
	[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
	[  5]   0.00-1.00   sec   115 MBytes   964 Mbits/sec   27    234 KBytes       
	[  5]   1.00-2.00   sec   113 MBytes   949 Mbits/sec    7    242 KBytes       
	[  5]   2.00-3.00   sec   113 MBytes   950 Mbits/sec    7    247 KBytes       
	[  5]   3.00-4.00   sec   113 MBytes   947 Mbits/sec    8    261 KBytes       
	[  5]   4.00-5.00   sec   114 MBytes   953 Mbits/sec   11    261 KBytes       
	[  5]   5.00-6.00   sec   113 MBytes   948 Mbits/sec    9    261 KBytes       
	[  5]   6.00-7.00   sec   113 MBytes   950 Mbits/sec    5    261 KBytes       
	[  5]   7.00-8.00   sec   113 MBytes   947 Mbits/sec   10    272 KBytes       
	[  5]   8.00-9.00   sec   113 MBytes   950 Mbits/sec    5    274 KBytes       
	[  5]   9.00-10.00  sec   113 MBytes   947 Mbits/sec    6    275 KBytes       
	- - - - - - - - - - - - - - - - - - - - - - - - -
	[ ID] Interval           Transfer     Bitrate         Retr
	[  5]   0.00-10.00  sec  1.11 GBytes   950 Mbits/sec   95             sender
	[  5]   0.00-10.04  sec  1.11 GBytes   945 Mbits/sec                  receiver

Do you want to continue troubleshooting non-SACK performance since I
have a reliable way to reproduce the issue, or leave it here with "I
should have had sack enabled" ?

-Eric

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-06-26 20:16 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-06  1:32 [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent Eric Wheeler
2025-06-06 17:16 ` Neal Cardwell
2025-06-06 22:34   ` Eric Wheeler
2025-06-07 19:13     ` Neal Cardwell
2025-06-07 22:54       ` Neal Cardwell
2025-06-07 23:26         ` Neal Cardwell
2025-06-09 17:45           ` Neal Cardwell
2025-06-10 17:15             ` Neal Cardwell
2025-06-12 18:23               ` Neal Cardwell
2025-06-13 21:02                 ` Neal Cardwell
2025-06-15 20:00               ` Eric Wheeler
2025-06-16 20:13                 ` Eric Wheeler
2025-06-16 21:07                   ` Neal Cardwell
2025-06-18 22:03                     ` Eric Wheeler
2025-06-25 19:17                       ` Eric Wheeler
2025-06-25 20:19                         ` Neal Cardwell
2025-06-25 23:15                           ` Eric Wheeler
2025-06-26 14:21                             ` Neal Cardwell
2025-06-26 20:16                               ` Eric Wheeler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).