* [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
@ 2025-06-06 1:32 Eric Wheeler
2025-06-06 17:16 ` Neal Cardwell
0 siblings, 1 reply; 19+ messages in thread
From: Eric Wheeler @ 2025-06-06 1:32 UTC (permalink / raw)
To: Neal Cardwell
Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Neal Cardwell,
Sasha Levin, Yuchung Cheng, stable
Hello Neal,
After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
10GbE via one SFP+ DAC (no bonding), we found TCP performance with
existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
across the switch on 10Gbit ports runs at full 10GbE.
Interestingly, the problem only presents itself when transmitting
from Linux; receive traffic (to Linux) performs just fine:
~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE -> device
~1Gbit: device =TX=> 1GbE -> switch -> 10GbE -> Linux v6.6.85
Through bisection, we found this first-bad commit:
tcp: fix to allow timestamp undo if no retransmits were sent
upstream: e37ab7373696e650d3b6262a5b882aadad69bb9e
stable 6.6.y: e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
To validate the regression, we performed the procedures below using the
latest versions of Linux. As you can see by comparing the performance
measurements, it is 10-16x faster after reverting. This appears to affect
everything after ~6.6.12-rc1 when the patch was introduced, as well as any
stable releases that cherry-picked it. I have pasted the small commit that
was reverted below for your reference.
Do you understand why it would behave this way, and what the correct fix
(or possible workaround) would be?
Currently we are able to reproduce this reliably, please let me know if
you would like us to gather any additional information.
-Eric
# Testing v6.6.92
## Before Revert
- git checkout v6.6.92
- build, boot, test with `iperf -c <ip>`
[ 5] local 192.168.1.52 port 42886 connected to 192.168.1.203 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 41.5 MBytes 348 Mbits/sec 153 322 KBytes
[ 5] 1.00-2.00 sec 3.68 MBytes 30.8 Mbits/sec 491 368 KBytes
[ 5] 2.00-3.00 sec 3.00 MBytes 25.2 Mbits/sec 1477 425 KBytes
[ 5] 3.00-4.00 sec 3.25 MBytes 27.2 Mbits/sec 1348 2.85 KBytes
[ 5] 4.00-5.00 sec 3.43 MBytes 28.8 Mbits/sec 1875 498 KBytes
[ 5] 5.00-6.00 sec 3.49 MBytes 29.3 Mbits/sec 1957 471 KBytes
[ 5] 6.00-7.00 sec 2.48 MBytes 20.8 Mbits/sec 1463 538 KBytes
[ 5] 7.00-8.00 sec 1.25 MBytes 10.5 Mbits/sec 1072 603 KBytes
[ 5] 8.00-9.00 sec 3.71 MBytes 31.2 Mbits/sec 1362 593 KBytes
[ 5] 9.00-10.00 sec 2.50 MBytes 21.0 Mbits/sec 1676 624 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 68.3 MBytes 57.3 Mbits/sec 12874 sender <<<
[ 5] 0.00-10.04 sec 64.9 MBytes 54.3 Mbits/sec receiver <<<
## After Revert
- git checkout v6.6.92
- git revert e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
- build, boot, test with `iperf -c <ip>`
[ 5] local 192.168.1.52 port 44136 connected to 192.168.1.203 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 90.5 MBytes 759 Mbits/sec 117 261 KBytes
[ 5] 1.00-2.00 sec 113 MBytes 949 Mbits/sec 5 264 KBytes
[ 5] 2.00-3.00 sec 113 MBytes 952 Mbits/sec 3 274 KBytes
[ 5] 3.00-4.00 sec 113 MBytes 947 Mbits/sec 5 267 KBytes
[ 5] 4.00-5.00 sec 113 MBytes 949 Mbits/sec 3 248 KBytes
[ 5] 5.00-6.00 sec 113 MBytes 951 Mbits/sec 8 247 KBytes
[ 5] 6.00-7.00 sec 113 MBytes 947 Mbits/sec 5 252 KBytes
[ 5] 7.00-8.00 sec 113 MBytes 950 Mbits/sec 6 247 KBytes
[ 5] 8.00-9.00 sec 113 MBytes 951 Mbits/sec 8 254 KBytes
[ 5] 9.00-10.00 sec 113 MBytes 948 Mbits/sec 3 247 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.08 GBytes 930 Mbits/sec 163 sender <<<
[ 5] 0.00-10.04 sec 1.08 GBytes 925 Mbits/sec receiver <<<
# Testing v6.15.1
## Before Revert
- git checkout v6.15.1
- build, boot, test with `iperf -c <ip>`
[ 5] local 192.168.1.52 port 52154 connected to 192.168.1.203 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 77.8 MBytes 652 Mbits/sec 73 298 KBytes
[ 5] 1.00-2.00 sec 3.61 MBytes 30.3 Mbits/sec 530 388 KBytes
[ 5] 2.00-3.00 sec 2.88 MBytes 24.2 Mbits/sec 1126 389 KBytes
[ 5] 3.00-4.00 sec 3.06 MBytes 25.7 Mbits/sec 1750 456 KBytes
[ 5] 4.00-5.00 sec 3.25 MBytes 27.2 Mbits/sec 1822 488 KBytes
[ 5] 5.00-6.00 sec 3.43 MBytes 28.8 Mbits/sec 1506 530 KBytes
[ 5] 6.00-7.00 sec 3.68 MBytes 30.8 Mbits/sec 1926 543 KBytes
[ 5] 7.00-8.00 sec 2.48 MBytes 20.8 Mbits/sec 1675 609 KBytes
[ 5] 8.00-9.00 sec 2.49 MBytes 20.9 Mbits/sec 941 332 KBytes
[ 5] 9.00-10.00 sec 11.1 MBytes 93.4 Mbits/sec 747 358 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 114 MBytes 95.5 Mbits/sec 12096 sender <<<
[ 5] 0.00-10.04 sec 110 MBytes 92.1 Mbits/sec receiver <<<
## After Revert
- git checkout v6.15.1
- git revert e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
- build, boot, test with `iperf -c <ip>`
[ 5] local 192.168.1.52 port 52266 connected to 192.168.1.203 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 91.3 MBytes 766 Mbits/sec 81 275 KBytes
[ 5] 1.00-2.00 sec 113 MBytes 947 Mbits/sec 6 281 KBytes
[ 5] 2.00-3.00 sec 113 MBytes 952 Mbits/sec 8 272 KBytes
[ 5] 3.00-4.00 sec 113 MBytes 950 Mbits/sec 3 274 KBytes
[ 5] 4.00-5.00 sec 113 MBytes 950 Mbits/sec 6 264 KBytes
[ 5] 5.00-6.00 sec 113 MBytes 944 Mbits/sec 6 272 KBytes
[ 5] 6.00-7.00 sec 114 MBytes 952 Mbits/sec 9 272 KBytes
[ 5] 7.00-8.00 sec 89.0 MBytes 746 Mbits/sec 62 315 KBytes
[ 5] 8.00-9.00 sec 113 MBytes 947 Mbits/sec 6 304 KBytes
[ 5] 9.00-10.00 sec 113 MBytes 948 Mbits/sec 6 302 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.06 GBytes 910 Mbits/sec 193 sender <<<
[ 5] 0.00-10.04 sec 1.06 GBytes 905 Mbits/sec receiver <<<
# git show e37ab7373696e650d3b6262a5b882aadad69bb9e|cat
commit e37ab7373696e650d3b6262a5b882aadad69bb9e
Author: Neal Cardwell <ncardwell@google.com>
Date: Tue Oct 1 20:05:15 2024 +0000
tcp: fix to allow timestamp undo if no retransmits were sent
Fix the TCP loss recovery undo logic in tcp_packet_delayed() so that
it can trigger undo even if TSQ prevents a fast recovery episode from
reaching tcp_retransmit_skb().
Geumhwan Yu <geumhwan.yu@samsung.com> recently reported that after
this commit from 2019:
commit bc9f38c8328e ("tcp: avoid unconditional congestion window undo
on SYN retransmit")
...and before this fix we could have buggy scenarios like the
following:
+ Due to reordering, a TCP connection receives some SACKs and enters a
spurious fast recovery.
+ TSQ prevents all invocations of tcp_retransmit_skb(), because many
skbs are queued in lower layers of the sending machine's network
stack; thus tp->retrans_stamp remains 0.
+ The connection receives a TCP timestamp ECR value echoing a
timestamp before the fast recovery, indicating that the fast
recovery was spurious.
+ The connection fails to undo the spurious fast recovery because
tp->retrans_stamp is 0, and thus tcp_packet_delayed() returns false,
due to the new logic in the 2019 commit: commit bc9f38c8328e ("tcp:
avoid unconditional congestion window undo on SYN retransmit")
This fix tweaks the logic to be more similar to the
tcp_packet_delayed() logic before bc9f38c8328e, except that we take
care not to be fooled by the FLAG_SYN_ACKED code path zeroing out
tp->retrans_stamp (the bug noted and fixed by Yuchung in
bc9f38c8328e).
Note that this returns the high-level behavior of tcp_packet_delayed()
to again match the comment for the function, which says: "Nothing was
retransmitted or returned timestamp is less than timestamp of the
first retransmission." Note that this comment is in the original
2005-04-16 Linux git commit, so this is evidently long-standing
behavior.
Fixes: bc9f38c8328e ("tcp: avoid unconditional congestion window undo on SYN retransmit")
Reported-by: Geumhwan Yu <geumhwan.yu@samsung.com>
Diagnosed-by: Geumhwan Yu <geumhwan.yu@samsung.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20241001200517.2756803-2-ncardwell.sw@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index cc05ec1faac8..233b77890795 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2473,8 +2473,22 @@ static bool tcp_skb_spurious_retrans(const struct tcp_sock *tp,
*/
static inline bool tcp_packet_delayed(const struct tcp_sock *tp)
{
- return tp->retrans_stamp &&
- tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
+ const struct sock *sk = (const struct sock *)tp;
+
+ if (tp->retrans_stamp &&
+ tcp_tsopt_ecr_before(tp, tp->retrans_stamp))
+ return true; /* got echoed TS before first retransmission */
+
+ /* Check if nothing was retransmitted (retrans_stamp==0), which may
+ * happen in fast recovery due to TSQ. But we ignore zero retrans_stamp
+ * in TCP_SYN_SENT, since when we set FLAG_SYN_ACKED we also clear
+ * retrans_stamp even if we had retransmitted the SYN.
+ */
+ if (!tp->retrans_stamp && /* no record of a retransmit/SYN? */
+ sk->sk_state != TCP_SYN_SENT) /* not the FLAG_SYN_ACKED case? */
+ return true; /* nothing was retransmitted */
+
+ return false;
}
/* Undo procedures. */
--
Eric Wheeler
www.linuxglobal.com
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
2025-06-06 1:32 [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent Eric Wheeler
@ 2025-06-06 17:16 ` Neal Cardwell
2025-06-06 22:34 ` Eric Wheeler
0 siblings, 1 reply; 19+ messages in thread
From: Neal Cardwell @ 2025-06-06 17:16 UTC (permalink / raw)
To: Eric Wheeler
Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
Yuchung Cheng, stable
On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
>
> Hello Neal,
>
> After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> across the switch on 10Gbit ports runs at full 10GbE.
>
> Interestingly, the problem only presents itself when transmitting
> from Linux; receive traffic (to Linux) performs just fine:
> ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE -> device
> ~1Gbit: device =TX=> 1GbE -> switch -> 10GbE -> Linux v6.6.85
>
> Through bisection, we found this first-bad commit:
>
> tcp: fix to allow timestamp undo if no retransmits were sent
> upstream: e37ab7373696e650d3b6262a5b882aadad69bb9e
> stable 6.6.y: e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
>
> To validate the regression, we performed the procedures below using the
> latest versions of Linux. As you can see by comparing the performance
> measurements, it is 10-16x faster after reverting. This appears to affect
> everything after ~6.6.12-rc1 when the patch was introduced, as well as any
> stable releases that cherry-picked it. I have pasted the small commit that
> was reverted below for your reference.
>
> Do you understand why it would behave this way, and what the correct fix
> (or possible workaround) would be?
>
> Currently we are able to reproduce this reliably, please let me know if
> you would like us to gather any additional information.
Hi Eric,
Thank you for your detailed report and your offer to run some more tests!
I don't have any good theories yet. It is striking that the apparent
retransmit rate is more than 100x higher in your "Before Revert" case
than in your "After Revert" case. It seems like something very odd is
going on. :-)
If you could re-run tests while gathering more information, and share
that information, that would be very useful.
What would be very useful would be the following information, for both
(a) Before Revert, and (b) After Revert kernels:
# as root, before the test starts, start instrumentation
# and leave it running in the background; something like:
(while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/ss.txt &
nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done) >
/tmp/nstat.txt &
tcpdump -w /tmp/tcpdump.${eth}.pcap -n -s 116 -c 1000000 &
# then run the test
# then kill the instrumentation loops running in the background:
kill %1 %2 %3
Then if you could copy the iperf output and these output files to a
web server, or Dropbox, or Google Drive, etc, and share the URL, I
would be very grateful.
For this next phase, there's no need to test both 6.6 and 6.15.
Testing either one is fine. We just need, say, 6.15 before the revert,
and 6.15 after the revert.
Thanks!
neal
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
2025-06-06 17:16 ` Neal Cardwell
@ 2025-06-06 22:34 ` Eric Wheeler
2025-06-07 19:13 ` Neal Cardwell
0 siblings, 1 reply; 19+ messages in thread
From: Eric Wheeler @ 2025-06-06 22:34 UTC (permalink / raw)
To: Neal Cardwell
Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
Yuchung Cheng, stable
[-- Attachment #1: Type: text/plain, Size: 7735 bytes --]
On Fri, 6 Jun 2025, Neal Cardwell wrote:
> On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> >
> > Hello Neal,
> >
> > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > across the switch on 10Gbit ports runs at full 10GbE.
> >
> > Interestingly, the problem only presents itself when transmitting
> > from Linux; receive traffic (to Linux) performs just fine:
> > ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE -> device
> > ~1Gbit: device =TX=> 1GbE -> switch -> 10GbE -> Linux v6.6.85
> >
> > Through bisection, we found this first-bad commit:
> >
> > tcp: fix to allow timestamp undo if no retransmits were sent
> > upstream: e37ab7373696e650d3b6262a5b882aadad69bb9e
> > stable 6.6.y: e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> >
>
> Thank you for your detailed report and your offer to run some more tests!
>
> I don't have any good theories yet. It is striking that the apparent
> retransmit rate is more than 100x higher in your "Before Revert" case
> than in your "After Revert" case. It seems like something very odd is
> going on. :-)
good point, I wonder what that might imply...
> If you could re-run tests while gathering more information, and share
> that information, that would be very useful.
>
> What would be very useful would be the following information, for both
> (a) Before Revert, and (b) After Revert kernels:
>
> # as root, before the test starts, start instrumentation
> # and leave it running in the background; something like:
> (while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/ss.txt &
> nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done) >
> /tmp/nstat.txt &
> tcpdump -w /tmp/tcpdump.${eth}.pcap -n -s 116 -c 1000000 &
>
> # then run the test
>
> # then kill the instrumentation loops running in the background:
> kill %1 %2 %3
Sure, here they are:
https://www.linuxglobal.com/out/for-neal/
These are the commands that we ran. You will probably notice that it is
running under a bridge, but the behavior is the same whether or not it
is enslaved to a bridge. (The way that these systems are configured it is
quite difficult to mess with the bridge so I would like to keep it as it
is for testing if possible.)
# Before Revert
WHEN=before-revert
(while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/$WHEN-ss.txt &
nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done) > /tmp/$WHEN-nstat.txt &
tcpdump -i br0 -w /tmp/$WHEN-tcpdump.${eth}.pcap -n -s 116 -c 1000000 host 192.168.1.203 &
iperf3 -c 192.168.1.203
kill %1 %2 %3
[1] 1769507
[2] 1769511
[3] 1769512
Connecting to host 192.168.1.203, port 5201
dropped privs to tcpdump
tcpdump: listening on br0, link-type EN10MB (Ethernet), snapshot length 116 bytes
[ 5] local 192.168.1.51 port 44674 connected to 192.168.1.203 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 110 MBytes 920 Mbits/sec 28 278 KBytes
[ 5] 1.00-2.00 sec 3.19 MBytes 26.7 Mbits/sec 260 336 KBytes
[ 5] 2.00-3.00 sec 3.06 MBytes 25.7 Mbits/sec 862 431 KBytes
[ 5] 3.00-4.00 sec 3.12 MBytes 26.2 Mbits/sec 1730 462 KBytes
[ 5] 4.00-5.00 sec 3.25 MBytes 27.2 Mbits/sec 1490 443 KBytes
[ 5] 5.00-6.00 sec 3.31 MBytes 27.8 Mbits/sec 1898 543 KBytes
[ 5] 6.00-7.00 sec 2.45 MBytes 20.5 Mbits/sec 1640 111 KBytes
[ 5] 7.00-8.00 sec 3.70 MBytes 31.0 Mbits/sec 1868 530 KBytes
[ 5] 8.00-9.00 sec 3.71 MBytes 31.1 Mbits/sec 2137 539 KBytes
[ 5] 9.00-10.00 sec 3.75 MBytes 31.5 Mbits/sec 1012 365 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 139 MBytes 117 Mbits/sec 12925 sender
[ 5] 0.00-10.04 sec 137 MBytes 114 Mbits/sec receiver
iperf Done.
35180 packets captured
35607 packets received by filter
0 packets dropped by kernel
[root@hv ~]# renice -20 $$
1760056 (process ID) old priority 0, new priority -20
[1] Terminated ( while true; do
date +%s.%N; ss -tenmoi; sleep 0.050;
done ) > /tmp/$WHEN-ss.txt
[2]- Terminated ( while true; do
date +%s.%N; nstat; sleep 0.050;
done ) > /tmp/$WHEN-nstat.txt
[3]+ Done tcpdump -i br0 -w /tmp/$WHEN-tcpdump.${eth}.pcap -n -s 116 -c 1000000 host 192.168.1.203
[root@hv ~]# tar cvzf $WHEN.tar.gz /tmp/$WHEN*
# After Revert
eth=br0
WHEN=after-revert
(while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/$WHEN-ss.txt &
nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done) > /tmp/$WHEN-nstat.txt &
tcpdump -i br0 -w /tmp/$WHEN-tcpdump.${eth}.pcap -n -s 116 -c 1000000 host 192.168.1.203 &
iperf3 -c 192.168.1.203
kill %1 %2 %3
[1] 1471593
[2] 1471597
[3] 1471598
Connecting to host 192.168.1.203, port 5201
dropped privs to tcpdump
tcpdump: listening on br0, link-type EN10MB (Ethernet), snapshot length 116 bytes
[ 5] local 192.168.1.52 port 41240 connected to 192.168.1.203 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 113 MBytes 948 Mbits/sec 12 282 KBytes
[ 5] 1.00-2.00 sec 90.5 MBytes 759 Mbits/sec 73 288 KBytes
[ 5] 2.00-3.00 sec 114 MBytes 952 Mbits/sec 4 287 KBytes
[ 5] 3.00-4.00 sec 89.9 MBytes 754 Mbits/sec 56 298 KBytes
[ 5] 4.00-5.00 sec 113 MBytes 945 Mbits/sec 26 247 KBytes
[ 5] 5.00-6.00 sec 113 MBytes 946 Mbits/sec 4 261 KBytes
[ 5] 6.00-7.00 sec 113 MBytes 947 Mbits/sec 8 267 KBytes
[ 5] 7.00-8.00 sec 89.4 MBytes 750 Mbits/sec 74 318 KBytes
[ 5] 8.00-9.00 sec 89.7 MBytes 752 Mbits/sec 83 269 KBytes
[ 5] 9.00-10.00 sec 90.2 MBytes 757 Mbits/sec 110 315 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1014 MBytes 851 Mbits/sec 450 sender
[ 5] 0.00-10.04 sec 1013 MBytes 846 Mbits/sec receiver
iperf Done.
131027 packets captured
131841 packets received by filter
0 packets dropped by kernel
[root@hv ~]# tar cvzf after-revert /tmp/before*
tar: Removing leading `/' from member names
tar: /tmp/before*: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors
[1] Terminated ( while true; do
date +%s.%N; ss -tenmoi; sleep 0.050;
done ) > /tmp/$WHEN-ss.txt
[2]- Terminated ( while true; do
date +%s.%N; nstat; sleep 0.050;
done ) > /tmp/$WHEN-nstat.txt
[3]+ Done tcpdump -i br0 -w /tmp/$WHEN-tcpdump.${eth}.pcap -n -s 116 -c 1000000 host 192.168.1.203
[root@hv ~]# tar cvzf $WHEN.tar.gz /tmp/$WHEN*
-Eric
> Then if you could copy the iperf output and these output files to a
> web server, or Dropbox, or Google Drive, etc, and share the URL, I
> would be very grateful.
>
> For this next phase, there's no need to test both 6.6 and 6.15.
> Testing either one is fine. We just need, say, 6.15 before the revert,
> and 6.15 after the revert.
>
> Thanks!
> neal
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
2025-06-06 22:34 ` Eric Wheeler
@ 2025-06-07 19:13 ` Neal Cardwell
2025-06-07 22:54 ` Neal Cardwell
0 siblings, 1 reply; 19+ messages in thread
From: Neal Cardwell @ 2025-06-07 19:13 UTC (permalink / raw)
To: Eric Wheeler
Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
Yuchung Cheng, stable
On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
>
> On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > >
> > > Hello Neal,
> > >
> > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > across the switch on 10Gbit ports runs at full 10GbE.
> > >
> > > Interestingly, the problem only presents itself when transmitting
> > > from Linux; receive traffic (to Linux) performs just fine:
> > > ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE -> device
> > > ~1Gbit: device =TX=> 1GbE -> switch -> 10GbE -> Linux v6.6.85
> > >
> > > Through bisection, we found this first-bad commit:
> > >
> > > tcp: fix to allow timestamp undo if no retransmits were sent
> > > upstream: e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > stable 6.6.y: e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > >
> >
> > Thank you for your detailed report and your offer to run some more tests!
> >
> > I don't have any good theories yet. It is striking that the apparent
> > retransmit rate is more than 100x higher in your "Before Revert" case
> > than in your "After Revert" case. It seems like something very odd is
> > going on. :-)
>
> good point, I wonder what that might imply...
>
> > If you could re-run tests while gathering more information, and share
> > that information, that would be very useful.
> >
> > What would be very useful would be the following information, for both
> > (a) Before Revert, and (b) After Revert kernels:
> >
> > # as root, before the test starts, start instrumentation
> > # and leave it running in the background; something like:
> > (while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/ss.txt &
> > nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done) >
> > /tmp/nstat.txt &
> > tcpdump -w /tmp/tcpdump.${eth}.pcap -n -s 116 -c 1000000 &
> >
> > # then run the test
> >
> > # then kill the instrumentation loops running in the background:
> > kill %1 %2 %3
>
> Sure, here they are:
>
> https://www.linuxglobal.com/out/for-neal/
Hi Eric,
Many thanks for the traces! These traces clearly show the buggy
behavior. The problem is an interaction between the non-SACK behavior
on these connections (due to the non-Linux "device" not supporting
SACK) and the undo logic. The problem is that, for non-SACK
connections, tcp_is_non_sack_preventing_reopen() holds steady in
CA_Recovery or CA_Loss at the end of a loss recovery episode but
clears tp->retrans_stamp to 0. So that upon the next ACK the "tcp: fix
to allow timestamp undo if no retransmits were sent" sees the
tp->retrans_stamp at 0 and erroneously concludes that no data was
retransmitted, and erroneously performs an undo of the cwnd reduction,
restoring cwnd immediately to the value it had before loss recovery.
This causes an immediate build-up of queues and another immediate loss
recovery episode. Thus the higher retransmit rate in the buggy
scenario.
I will work on a packetdrill reproducer, test a fix, and post a patch
for testing. I think the simplest fix would be to have
tcp_packet_delayed(), when tp->retrans_stamp is zero, check for the
(tp->snd_una == tp->high_seq && tcp_is_reno(tp)) condition and not
allow tcp_packet_delayed() to return true in that case. That should be
a precise fix for this scenario and does not risk changing behavior
for the much more common case of loss recovery with SACK support.
Eric, would you be willing to test a simple bug fix patch for this?
Thanks!
neal
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
2025-06-07 19:13 ` Neal Cardwell
@ 2025-06-07 22:54 ` Neal Cardwell
2025-06-07 23:26 ` Neal Cardwell
0 siblings, 1 reply; 19+ messages in thread
From: Neal Cardwell @ 2025-06-07 22:54 UTC (permalink / raw)
To: Eric Wheeler
Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
Yuchung Cheng, stable
On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> >
> > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > >
> > > > Hello Neal,
> > > >
> > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > >
> > > > Interestingly, the problem only presents itself when transmitting
> > > > from Linux; receive traffic (to Linux) performs just fine:
> > > > ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE -> device
> > > > ~1Gbit: device =TX=> 1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > >
> > > > Through bisection, we found this first-bad commit:
> > > >
> > > > tcp: fix to allow timestamp undo if no retransmits were sent
> > > > upstream: e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > stable 6.6.y: e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > >
> > >
> > > Thank you for your detailed report and your offer to run some more tests!
> > >
> > > I don't have any good theories yet. It is striking that the apparent
> > > retransmit rate is more than 100x higher in your "Before Revert" case
> > > than in your "After Revert" case. It seems like something very odd is
> > > going on. :-)
> >
> > good point, I wonder what that might imply...
> >
> > > If you could re-run tests while gathering more information, and share
> > > that information, that would be very useful.
> > >
> > > What would be very useful would be the following information, for both
> > > (a) Before Revert, and (b) After Revert kernels:
> > >
> > > # as root, before the test starts, start instrumentation
> > > # and leave it running in the background; something like:
> > > (while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/ss.txt &
> > > nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done) >
> > > /tmp/nstat.txt &
> > > tcpdump -w /tmp/tcpdump.${eth}.pcap -n -s 116 -c 1000000 &
> > >
> > > # then run the test
> > >
> > > # then kill the instrumentation loops running in the background:
> > > kill %1 %2 %3
> >
> > Sure, here they are:
> >
> > https://www.linuxglobal.com/out/for-neal/
>
> Hi Eric,
>
> Many thanks for the traces! These traces clearly show the buggy
> behavior. The problem is an interaction between the non-SACK behavior
> on these connections (due to the non-Linux "device" not supporting
> SACK) and the undo logic. The problem is that, for non-SACK
> connections, tcp_is_non_sack_preventing_reopen() holds steady in
> CA_Recovery or CA_Loss at the end of a loss recovery episode but
> clears tp->retrans_stamp to 0. So that upon the next ACK the "tcp: fix
> to allow timestamp undo if no retransmits were sent" sees the
> tp->retrans_stamp at 0 and erroneously concludes that no data was
> retransmitted, and erroneously performs an undo of the cwnd reduction,
> restoring cwnd immediately to the value it had before loss recovery.
> This causes an immediate build-up of queues and another immediate loss
> recovery episode. Thus the higher retransmit rate in the buggy
> scenario.
>
> I will work on a packetdrill reproducer, test a fix, and post a patch
> for testing. I think the simplest fix would be to have
> tcp_packet_delayed(), when tp->retrans_stamp is zero, check for the
> (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) condition and not
> allow tcp_packet_delayed() to return true in that case. That should be
> a precise fix for this scenario and does not risk changing behavior
> for the much more common case of loss recovery with SACK support.
Indeed, I'm able to reproduce this issue with erroneous undo events on
non-SACK connections at the end of loss recovery with the attached
packetdrill script.
When you run that script on a kernel with the "tcp: fix to allow
timestamp undo if no retransmits were sent" patch, we see:
+ nstat shows an erroneous TcpExtTCPFullUndo event
+ the loss recovery reduces cwnd from the initial 10 to the correct 7
(from CUBIC) but then the erroneous undo event restores the pre-loss
cwnd of 10 and leads to a final cwnd value of 11
I will test a patch with the proposed fix and report back.
neal
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
2025-06-07 22:54 ` Neal Cardwell
@ 2025-06-07 23:26 ` Neal Cardwell
2025-06-09 17:45 ` Neal Cardwell
0 siblings, 1 reply; 19+ messages in thread
From: Neal Cardwell @ 2025-06-07 23:26 UTC (permalink / raw)
To: Eric Wheeler
Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
Yuchung Cheng, stable
[-- Attachment #1: Type: text/plain, Size: 5031 bytes --]
On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > >
> > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > >
> > > > > Hello Neal,
> > > > >
> > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > >
> > > > > Interestingly, the problem only presents itself when transmitting
> > > > > from Linux; receive traffic (to Linux) performs just fine:
> > > > > ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE -> device
> > > > > ~1Gbit: device =TX=> 1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > > >
> > > > > Through bisection, we found this first-bad commit:
> > > > >
> > > > > tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > upstream: e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > stable 6.6.y: e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > > >
> > > >
> > > > Thank you for your detailed report and your offer to run some more tests!
> > > >
> > > > I don't have any good theories yet. It is striking that the apparent
> > > > retransmit rate is more than 100x higher in your "Before Revert" case
> > > > than in your "After Revert" case. It seems like something very odd is
> > > > going on. :-)
> > >
> > > good point, I wonder what that might imply...
> > >
> > > > If you could re-run tests while gathering more information, and share
> > > > that information, that would be very useful.
> > > >
> > > > What would be very useful would be the following information, for both
> > > > (a) Before Revert, and (b) After Revert kernels:
> > > >
> > > > # as root, before the test starts, start instrumentation
> > > > # and leave it running in the background; something like:
> > > > (while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/ss.txt &
> > > > nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done) >
> > > > /tmp/nstat.txt &
> > > > tcpdump -w /tmp/tcpdump.${eth}.pcap -n -s 116 -c 1000000 &
> > > >
> > > > # then run the test
> > > >
> > > > # then kill the instrumentation loops running in the background:
> > > > kill %1 %2 %3
> > >
> > > Sure, here they are:
> > >
> > > https://www.linuxglobal.com/out/for-neal/
> >
> > Hi Eric,
> >
> > Many thanks for the traces! These traces clearly show the buggy
> > behavior. The problem is an interaction between the non-SACK behavior
> > on these connections (due to the non-Linux "device" not supporting
> > SACK) and the undo logic. The problem is that, for non-SACK
> > connections, tcp_is_non_sack_preventing_reopen() holds steady in
> > CA_Recovery or CA_Loss at the end of a loss recovery episode but
> > clears tp->retrans_stamp to 0. So that upon the next ACK the "tcp: fix
> > to allow timestamp undo if no retransmits were sent" sees the
> > tp->retrans_stamp at 0 and erroneously concludes that no data was
> > retransmitted, and erroneously performs an undo of the cwnd reduction,
> > restoring cwnd immediately to the value it had before loss recovery.
> > This causes an immediate build-up of queues and another immediate loss
> > recovery episode. Thus the higher retransmit rate in the buggy
> > scenario.
> >
> > I will work on a packetdrill reproducer, test a fix, and post a patch
> > for testing. I think the simplest fix would be to have
> > tcp_packet_delayed(), when tp->retrans_stamp is zero, check for the
> > (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) condition and not
> > allow tcp_packet_delayed() to return true in that case. That should be
> > a precise fix for this scenario and does not risk changing behavior
> > for the much more common case of loss recovery with SACK support.
>
> Indeed, I'm able to reproduce this issue with erroneous undo events on
> non-SACK connections at the end of loss recovery with the attached
> packetdrill script.
>
> When you run that script on a kernel with the "tcp: fix to allow
> timestamp undo if no retransmits were sent" patch, we see:
>
> + nstat shows an erroneous TcpExtTCPFullUndo event
>
> + the loss recovery reduces cwnd from the initial 10 to the correct 7
> (from CUBIC) but then the erroneous undo event restores the pre-loss
> cwnd of 10 and leads to a final cwnd value of 11
>
> I will test a patch with the proposed fix and report back.
Oops, forgot to attach the packetdrill script! Let's try again...
neal
[-- Attachment #2: fr-non-sack-hold-at-high-seq.pkt --]
[-- Type: application/octet-stream, Size: 2362 bytes --]
// Test that in non-SACK fast recovery, we stay in CA_Recovery when
// snd_una == high_seq, and correctly leave when snd_una > high_seq.
// And that this does not cause a spurious undo.
// Set up config.
`../common/defaults.sh`
// Establish a connection.
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 `nstat -n`
+0 < S 0:0(0) win 32792 <mss 1000,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 8>
+.020 < . 1:1(0) ack 1 win 320
+0 accept(3, ..., ...) = 4
// Write some data, and send the initial congestion window.
+0 write(4, ..., 15000) = 15000
+0 > P. 1:10001(10000) ack 1
// Limited transmit: on first dupack (for pkt 2), send a new data segment.
+.020 < . 1:1(0) ack 1 win 320
+0 > . 10001:11001(1000) ack 1
+0 %{ print(tcpi_snd_cwnd) }%
+0 %{ print(tcpi_sacked) }%
// Limited transmit: on second dupack (for pkt 3), send a new data segment.
+.002 < . 1:1(0) ack 1 win 320
+0 > . 11001:12001(1000) ack 1
+0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
+0 %{ print(tcpi_sacked) }%
// On third dupack (for pkt 4), enter fast recovery.
+0 < . 1:1(0) ack 1 win 320
+0 > . 1:1001(1000) ack 1
+0 %{ assert tcpi_ca_state == TCP_CA_Recovery, tcpi_ca_state }%
// Receive dupack for pkt 5:
+.002 < . 1:1(0) ack 1 win 320
// Receive dupack for pkt 6:
+.002 < . 1:1(0) ack 1 win 320
// Receive dupack for pkt 7:
+.002 < . 1:1(0) ack 1 win 320
// Receive dupack for pkt 8:
+.002 < . 1:1(0) ack 1 win 320
// Receive dupack for pkt 9:
+.002 < . 1:1(0) ack 1 win 320
// Receive dupack for pkt 10:
+.002 < . 1:1(0) ack 1 win 320
// Receive dupack for limited transmit of pkt 11:
+.002 < . 1:1(0) ack 1 win 320
// Receive dupack for limited transmit of pkt 12:
+.002 < . 1:1(0) ack 1 win 320
// Receive cumulative ACK for fast retransmit that plugged the sequence hole.
// Because this is a non-SACK connection and snd_una == high_seq,
// we stay in CA_Recovery.
+.020 < . 1:1(0) ack 12001 win 320
+0 %{ assert tcpi_ca_state == TCP_CA_Recovery, tcpi_ca_state }%
+0 %{ print(tcpi_snd_cwnd) }%
// Receive ACK advancing snd_una past high_seq
+.002 < . 1:1(0) ack 13001 win 320
+0 %{ assert tcpi_ca_state == TCP_CA_Open, tcpi_ca_state }%
+0 %{ print(tcpi_snd_cwnd) }%
+0 `nstat`
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
2025-06-07 23:26 ` Neal Cardwell
@ 2025-06-09 17:45 ` Neal Cardwell
2025-06-10 17:15 ` Neal Cardwell
0 siblings, 1 reply; 19+ messages in thread
From: Neal Cardwell @ 2025-06-09 17:45 UTC (permalink / raw)
To: Eric Wheeler
Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
Yuchung Cheng, stable
[-- Attachment #1: Type: text/plain, Size: 5662 bytes --]
On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > >
> > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > >
> > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > >
> > > > > > Hello Neal,
> > > > > >
> > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > >
> > > > > > Interestingly, the problem only presents itself when transmitting
> > > > > > from Linux; receive traffic (to Linux) performs just fine:
> > > > > > ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE -> device
> > > > > > ~1Gbit: device =TX=> 1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > > > >
> > > > > > Through bisection, we found this first-bad commit:
> > > > > >
> > > > > > tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > upstream: e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > stable 6.6.y: e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > > > >
> > > > >
> > > > > Thank you for your detailed report and your offer to run some more tests!
> > > > >
> > > > > I don't have any good theories yet. It is striking that the apparent
> > > > > retransmit rate is more than 100x higher in your "Before Revert" case
> > > > > than in your "After Revert" case. It seems like something very odd is
> > > > > going on. :-)
> > > >
> > > > good point, I wonder what that might imply...
> > > >
> > > > > If you could re-run tests while gathering more information, and share
> > > > > that information, that would be very useful.
> > > > >
> > > > > What would be very useful would be the following information, for both
> > > > > (a) Before Revert, and (b) After Revert kernels:
> > > > >
> > > > > # as root, before the test starts, start instrumentation
> > > > > # and leave it running in the background; something like:
> > > > > (while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/ss.txt &
> > > > > nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done) >
> > > > > /tmp/nstat.txt &
> > > > > tcpdump -w /tmp/tcpdump.${eth}.pcap -n -s 116 -c 1000000 &
> > > > >
> > > > > # then run the test
> > > > >
> > > > > # then kill the instrumentation loops running in the background:
> > > > > kill %1 %2 %3
> > > >
> > > > Sure, here they are:
> > > >
> > > > https://www.linuxglobal.com/out/for-neal/
> > >
> > > Hi Eric,
> > >
> > > Many thanks for the traces! These traces clearly show the buggy
> > > behavior. The problem is an interaction between the non-SACK behavior
> > > on these connections (due to the non-Linux "device" not supporting
> > > SACK) and the undo logic. The problem is that, for non-SACK
> > > connections, tcp_is_non_sack_preventing_reopen() holds steady in
> > > CA_Recovery or CA_Loss at the end of a loss recovery episode but
> > > clears tp->retrans_stamp to 0. So that upon the next ACK the "tcp: fix
> > > to allow timestamp undo if no retransmits were sent" sees the
> > > tp->retrans_stamp at 0 and erroneously concludes that no data was
> > > retransmitted, and erroneously performs an undo of the cwnd reduction,
> > > restoring cwnd immediately to the value it had before loss recovery.
> > > This causes an immediate build-up of queues and another immediate loss
> > > recovery episode. Thus the higher retransmit rate in the buggy
> > > scenario.
> > >
> > > I will work on a packetdrill reproducer, test a fix, and post a patch
> > > for testing. I think the simplest fix would be to have
> > > tcp_packet_delayed(), when tp->retrans_stamp is zero, check for the
> > > (tp->snd_una == tp->high_seq && tcp_is_reno(tp)) condition and not
> > > allow tcp_packet_delayed() to return true in that case. That should be
> > > a precise fix for this scenario and does not risk changing behavior
> > > for the much more common case of loss recovery with SACK support.
> >
> > Indeed, I'm able to reproduce this issue with erroneous undo events on
> > non-SACK connections at the end of loss recovery with the attached
> > packetdrill script.
> >
> > When you run that script on a kernel with the "tcp: fix to allow
> > timestamp undo if no retransmits were sent" patch, we see:
> >
> > + nstat shows an erroneous TcpExtTCPFullUndo event
> >
> > + the loss recovery reduces cwnd from the initial 10 to the correct 7
> > (from CUBIC) but then the erroneous undo event restores the pre-loss
> > cwnd of 10 and leads to a final cwnd value of 11
> >
> > I will test a patch with the proposed fix and report back.
And the attached packetdrill script (which "passes" on kernel with
"tcp: fix to allow timestamp undo if no retransmits were sent") shows
that a similar erroneous undo (TcpExtTCPLossUndo in this case) happens
at the end of RTO recovery (CA_Loss) if there is an ACK that makes
snd_una exactly equal high_seq. This is expected, given that both fast
recovery and RTO recovery use tcp_is_non_sack_preventing_reopen().
neal
[-- Attachment #2: frto-real-timeout-nonsack-hold-at-high-seq.pkt --]
[-- Type: application/octet-stream, Size: 2028 bytes --]
// Test F-RTO on a real timeout without SACK.
// Identical to frto-real-timeout-nonsack.pkt
// except that there is an ACK that advances snd_una
// to exactly equal high_seq, to test this tricky case.
// Set up config.
`../common/defaults.sh`
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 `nstat -n`
+0 < S 0:0(0) win 32792 <mss 1000,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 8>
+.02 < . 1:1(0) ack 1 win 257
+0 accept(3, ..., ...) = 4
+0 write(4, ..., 15000) = 15000
+0 > P. 1:10001(10000) ack 1
// RTO and retransmit head
+.22 > . 1:1001(1000) ack 1
// F-RTO probes
+.01 < . 1:1(0) ack 1001 win 257
+0 > P. 10001:12001(2000) ack 1
// The probes are acked so the timeout is real.
+.05 < . 1:1(0) ack 1001 win 257
+0 > . 1001:2001(1000) ack 1
+0 %{ assert tcpi_ca_state == TCP_CA_Loss, tcpi_ca_state }%
+0 %{ assert tcpi_snd_cwnd == 2, tcpi_snd_cwnd }%
+0 %{ assert tcpi_snd_ssthresh == 7, tcpi_snd_ssthresh }%
+.002 < . 1:1(0) ack 1001 win 257
+0 > . 2001:3001(1000) ack 1
+.05 < . 1:1(0) ack 3001 win 257
+0 > . 3001:4001(1000) ack 1
+.05 < . 1:1(0) ack 4001 win 257
+0 > P. 4001:6001(2000) ack 1
+.05 < . 1:1(0) ack 6001 win 257
+0 > P. 6001:10001(4000) ack 1
// Receive ack and advance snd_una to match high_seq.
+.05 < . 1:1(0) ack 10001 win 257
+0 %{ assert tcpi_ca_state == TCP_CA_Loss, tcpi_ca_state }%
+0 %{ assert tcpi_snd_cwnd == 7, tcpi_snd_cwnd }%
+0 %{ assert tcpi_snd_ssthresh == 7, tcpi_snd_ssthresh }%
// Receive ack and advance snd_una beyond high_seq.
+0 < . 1:1(0) ack 12001 win 257
+0 > P. 12001:15001(3000) ack 1
+0 %{ assert tcpi_ca_state == TCP_CA_Loss, tcpi_ca_state }%
+0 %{ assert tcpi_snd_cwnd == 12, tcpi_snd_cwnd }%
+.05 < . 1:1(0) ack 15001 win 257
+0 %{ assert tcpi_ca_state == TCP_CA_Open, tcpi_ca_state }%
+0 %{ assert tcpi_snd_cwnd == 15, tcpi_snd_cwnd }%
+0 `nstat`
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
2025-06-09 17:45 ` Neal Cardwell
@ 2025-06-10 17:15 ` Neal Cardwell
2025-06-12 18:23 ` Neal Cardwell
2025-06-15 20:00 ` Eric Wheeler
0 siblings, 2 replies; 19+ messages in thread
From: Neal Cardwell @ 2025-06-10 17:15 UTC (permalink / raw)
To: Eric Wheeler
Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
Yuchung Cheng, stable
[-- Attachment #1: Type: text/plain, Size: 2522 bytes --]
On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > >
> > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > >
> > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > >
> > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > >
> > > > > > > Hello Neal,
> > > > > > >
> > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > >
> > > > > > > Interestingly, the problem only presents itself when transmitting
> > > > > > > from Linux; receive traffic (to Linux) performs just fine:
> > > > > > > ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE -> device
> > > > > > > ~1Gbit: device =TX=> 1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > > > > >
> > > > > > > Through bisection, we found this first-bad commit:
> > > > > > >
> > > > > > > tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > upstream: e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > stable 6.6.y: e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
Hi Eric,
Do you have cycles to test a proposed fix patch developed by our team?
The attached patch should apply (with "git am") for any recent kernel
that has the "tcp: fix to allow timestamp undo if no retransmits were
sent" patch it is fixing. So you should be able to test it on top of
the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
easier.
If you have cycles to rerun your iperf test, with tcpdump, nstat, and
ss instrumentation, that would be fantastic!
The patch passes our internal packetdrill test suite, including new
tests for this issue (based on the packetdrill scripts posted earlier
in this thread.
But it would be fantastic to directly confirm that this fixes your issue.
Thanks!
neal
[-- Attachment #2: 0001-tcp-fix-tcp_packet_delayed-for-tcp_is_non_sack_preve.patch --]
[-- Type: application/octet-stream, Size: 4511 bytes --]
From 3cc3efcc051b27ced412c7d02e7784da8810751c Mon Sep 17 00:00:00 2001
From: Neal Cardwell <ncardwell@google.com>
Date: Sat, 7 Jun 2025 23:08:26 +0000
Subject: [PATCH] tcp: fix tcp_packet_delayed() for
tcp_is_non_sack_preventing_reopen() behavior
After the following commit from 2024:
commit e37ab7373696 ("tcp: fix to allow timestamp undo if no retransmits were sent")
...there was buggy behavior where TCP connections without SACK support
could easily see erroneous undo events at the end of fast recovery or
RTO recovery episodes. The erroneous undo events could cause those
connections to be suffer repeated loss recovery episodes and high
retransmit rates.
The problem was an interaction between the non-SACK behavior on these
connections and the undo logic. The problem is that, for non-SACK
connections at the end of a loss recovery episode, if snd_una ==
high_seq, then tcp_is_non_sack_preventing_reopen() holds steady in
CA_Recovery or CA_Loss, but clears tp->retrans_stamp to 0. Then upon
the next ACK the "tcp: fix to allow timestamp undo if no retransmits
were sent" logic saw the tp->retrans_stamp at 0 and erroneously
concluded that no data was retransmitted, and erroneously performed an
undo of the cwnd reduction, restoring cwnd immediately to the value it
had before loss recovery. This caused an immediate burst of traffic
and build-up of queues and likely another immediate loss recovery
episode.
This commit fixes tcp_packet_delayed() to ignore zero retrans_stamp
values for non-SACK connections when snd_una is at or above high_seq,
because tcp_is_non_sack_preventing_reopen() clears retrans_stamp in
this case, so it's not a valid signal that we can undo.
Note that the commit named in the Fixes footer restored long-present
behavior from roughly 2005-2019, so apparently this bug was present
for a while during that era, and this was simply not caught.
Fixes: e37ab7373696 ("tcp: fix to allow timestamp undo if no retransmits were sent")
Reported-by: Eric Wheeler <netdev@lists.ewheeler.net>
Closes: https://lore.kernel.org/netdev/64ea9333-e7f9-0df-b0f2-8d566143acab@ewheeler.net/
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Co-developed-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
net/ipv4/tcp_input.c | 37 +++++++++++++++++++++++++------------
1 file changed, 25 insertions(+), 12 deletions(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a35018e2d0ba2..55e86275c823d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2484,20 +2484,33 @@ static inline bool tcp_packet_delayed(const struct tcp_sock *tp)
{
const struct sock *sk = (const struct sock *)tp;
- if (tp->retrans_stamp &&
- tcp_tsopt_ecr_before(tp, tp->retrans_stamp))
- return true; /* got echoed TS before first retransmission */
-
- /* Check if nothing was retransmitted (retrans_stamp==0), which may
- * happen in fast recovery due to TSQ. But we ignore zero retrans_stamp
- * in TCP_SYN_SENT, since when we set FLAG_SYN_ACKED we also clear
- * retrans_stamp even if we had retransmitted the SYN.
+ /* Received an echoed timestamp before the first retransmission? */
+ if (tp->retrans_stamp)
+ return tcp_tsopt_ecr_before(tp, tp->retrans_stamp);
+
+ /* We set tp->retrans_stamp upon the first retransmission of a loss
+ * recovery episode, so normally if tp->retrans_stamp is 0 then no
+ * retransmission has happened yet (likely due to TSQ, which can cause
+ * fast retransmits to be delayed). So if snd_una advanced while
+ * (tp->retrans_stamp is 0 then apparently a packet was merely delayed,
+ * not lost. But there are exceptions where we retransmit but then
+ * clear tp->retrans_stamp, so we check for those exceptions.
*/
- if (!tp->retrans_stamp && /* no record of a retransmit/SYN? */
- sk->sk_state != TCP_SYN_SENT) /* not the FLAG_SYN_ACKED case? */
- return true; /* nothing was retransmitted */
- return false;
+ /* (1) For non-SACK connections, tcp_is_non_sack_preventing_reopen()
+ * clears tp->retrans_stamp when snd_una == high_seq.
+ */
+ if (!tcp_is_sack(tp) && !before(tp->snd_una, tp->high_seq))
+ return false;
+
+ /* (2) In TCP_SYN_SENT tcp_clean_rtx_queue() clears tp->retrans_stamp
+ * when setting FLAG_SYN_ACKED is set, even if the SYN was
+ * retransmitted.
+ */
+ if (sk->sk_state == TCP_SYN_SENT)
+ return false;
+
+ return true; /* tp->retrans_stamp is zero; no retransmit yet */
}
/* Undo procedures. */
--
2.50.0.rc0.642.g800a2b2222-goog
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
2025-06-10 17:15 ` Neal Cardwell
@ 2025-06-12 18:23 ` Neal Cardwell
2025-06-13 21:02 ` Neal Cardwell
2025-06-15 20:00 ` Eric Wheeler
1 sibling, 1 reply; 19+ messages in thread
From: Neal Cardwell @ 2025-06-12 18:23 UTC (permalink / raw)
To: Eric Wheeler
Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
Yuchung Cheng, stable
On Tue, Jun 10, 2025 at 1:15 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > >
> > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > >
> > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > >
> > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > >
> > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > >
> > > > > > > > Hello Neal,
> > > > > > > >
> > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > >
> > > > > > > > Interestingly, the problem only presents itself when transmitting
> > > > > > > > from Linux; receive traffic (to Linux) performs just fine:
> > > > > > > > ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE -> device
> > > > > > > > ~1Gbit: device =TX=> 1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > > > > > >
> > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > >
> > > > > > > > tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > upstream: e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > stable 6.6.y: e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
>
> Hi Eric,
>
> Do you have cycles to test a proposed fix patch developed by our team?
>
> The attached patch should apply (with "git am") for any recent kernel
> that has the "tcp: fix to allow timestamp undo if no retransmits were
> sent" patch it is fixing. So you should be able to test it on top of
> the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> easier.
>
> If you have cycles to rerun your iperf test, with tcpdump, nstat, and
> ss instrumentation, that would be fantastic!
>
> The patch passes our internal packetdrill test suite, including new
> tests for this issue (based on the packetdrill scripts posted earlier
> in this thread.
>
> But it would be fantastic to directly confirm that this fixes your issue.
Hi Eric (Wheeler),
Just checking: would you be able to test that patch (from my previous
message) in your environment?
If not, given that the patch fixes our packetdrill reproducers, we can
send the patch to the list as-is without that testing.
Thanks,
neal
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
2025-06-12 18:23 ` Neal Cardwell
@ 2025-06-13 21:02 ` Neal Cardwell
0 siblings, 0 replies; 19+ messages in thread
From: Neal Cardwell @ 2025-06-13 21:02 UTC (permalink / raw)
To: Eric Wheeler
Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
Yuchung Cheng, stable
On Thu, Jun 12, 2025 at 2:23 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Tue, Jun 10, 2025 at 1:15 PM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> > >
> > > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > >
> > > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > >
> > > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > >
> > > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > >
> > > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > >
> > > > > > > > > Hello Neal,
> > > > > > > > >
> > > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > > >
> > > > > > > > > Interestingly, the problem only presents itself when transmitting
> > > > > > > > > from Linux; receive traffic (to Linux) performs just fine:
> > > > > > > > > ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE -> device
> > > > > > > > > ~1Gbit: device =TX=> 1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > > > > > > >
> > > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > > >
> > > > > > > > > tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > > upstream: e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > > stable 6.6.y: e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> >
> > Hi Eric,
> >
> > Do you have cycles to test a proposed fix patch developed by our team?
> >
> > The attached patch should apply (with "git am") for any recent kernel
> > that has the "tcp: fix to allow timestamp undo if no retransmits were
> > sent" patch it is fixing. So you should be able to test it on top of
> > the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> > easier.
> >
> > If you have cycles to rerun your iperf test, with tcpdump, nstat, and
> > ss instrumentation, that would be fantastic!
> >
> > The patch passes our internal packetdrill test suite, including new
> > tests for this issue (based on the packetdrill scripts posted earlier
> > in this thread.
> >
> > But it would be fantastic to directly confirm that this fixes your issue.
>
> Hi Eric (Wheeler),
>
> Just checking: would you be able to test that patch (from my previous
> message) in your environment?
>
> If not, given that the patch fixes our packetdrill reproducers, we can
> send the patch to the list as-is without that testing.
Update on this thread; I sent the patch to the netdev list:
[PATCH net] tcp: fix tcp_packet_delayed() for
tcp_is_non_sack_preventing_reopen() behavior
https://lore.kernel.org/netdev/20250613193056.1585351-1-ncardwell.sw@gmail.com/
best,
neal
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
2025-06-10 17:15 ` Neal Cardwell
2025-06-12 18:23 ` Neal Cardwell
@ 2025-06-15 20:00 ` Eric Wheeler
2025-06-16 20:13 ` Eric Wheeler
1 sibling, 1 reply; 19+ messages in thread
From: Eric Wheeler @ 2025-06-15 20:00 UTC (permalink / raw)
To: Neal Cardwell
Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
Yuchung Cheng, stable
[-- Attachment #1: Type: text/plain, Size: 3014 bytes --]
On Tue, 10 Jun 2025, Neal Cardwell wrote:
> On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > >
> > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > >
> > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > >
> > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > >
> > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > >
> > > > > > > > Hello Neal,
> > > > > > > >
> > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > >
> > > > > > > > Interestingly, the problem only presents itself when transmitting
> > > > > > > > from Linux; receive traffic (to Linux) performs just fine:
> > > > > > > > ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE -> device
> > > > > > > > ~1Gbit: device =TX=> 1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > > > > > >
> > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > >
> > > > > > > > tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > upstream: e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > stable 6.6.y: e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
>
> Hi Eric,
>
> Do you have cycles to test a proposed fix patch developed by our team?
Sorry for the radio silence, I just got back in town so I can do that
later this week.
> The attached patch should apply (with "git am") for any recent kernel
> that has the "tcp: fix to allow timestamp undo if no retransmits were
> sent" patch it is fixing. So you should be able to test it on top of
> the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> easier.
I can tested on top of 6.6-stable but I have to put a production system
into standby in order to do that, so I will report back when I can,
possibly as late as Friday 6/20 because the office is closed that day and
I can work on it.
> If you have cycles to rerun your iperf test, with tcpdump, nstat, and
> ss instrumentation, that would be fantastic!
will do
> The patch passes our internal packetdrill test suite, including new
> tests for this issue (based on the packetdrill scripts posted earlier
> in this thread.
Awesome thank you for all of the effort to fix this!
-Eric
>
> But it would be fantastic to directly confirm that this fixes your issue.
>
> Thanks!
> neal
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
2025-06-15 20:00 ` Eric Wheeler
@ 2025-06-16 20:13 ` Eric Wheeler
2025-06-16 21:07 ` Neal Cardwell
0 siblings, 1 reply; 19+ messages in thread
From: Eric Wheeler @ 2025-06-16 20:13 UTC (permalink / raw)
To: Neal Cardwell
Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
Yuchung Cheng, stable
[-- Attachment #1: Type: text/plain, Size: 6758 bytes --]
On Sun, 15 Jun 2025, Eric Wheeler wrote:
> On Tue, 10 Jun 2025, Neal Cardwell wrote:
> > On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> > >
> > > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > >
> > > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > >
> > > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > >
> > > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > >
> > > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > >
> > > > > > > > > Hello Neal,
> > > > > > > > >
> > > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > > >
> > > > > > > > > Interestingly, the problem only presents itself when transmitting
> > > > > > > > > from Linux; receive traffic (to Linux) performs just fine:
> > > > > > > > > ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE -> device
> > > > > > > > > ~1Gbit: device =TX=> 1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > > > > > > >
> > > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > > >
> > > > > > > > > tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > > upstream: e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > > stable 6.6.y: e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> >
>
> > The attached patch should apply (with "git am") for any recent kernel
> > that has the "tcp: fix to allow timestamp undo if no retransmits were
> > sent" patch it is fixing. So you should be able to test it on top of
> > the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> > easier.
Definitely better, but performance is ~15% slower vs reverting, and the
retransmit counts are still higher than the other. In the two sections
below you can see the difference between after the fix and after the
revert.
Here is the output:
## After fixing with your patch:
https://www.linuxglobal.com/out/for-neal/after-fix.tar.gz
WHEN=after-fix
(while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/$WHEN-ss.txt &
nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done) > /tmp/$WHEN-nstat.txt &
tcpdump -i br0 -w /tmp/$WHEN-tcpdump.${eth}.pcap -n -s 116 -c 1000000 host 192.168.1.203 &
iperf3 -c 192.168.1.203
kill %1 %2 %3
[1] 2300
nstat: history is aged out, resetting
[2] 2304
[3] 2305
Connecting to host 192.168.1.203, port 5201
[ 5] local 192.168.1.52 port 47730 connected to 192.168.1.203 port 5201
dropped privs to tcpdump
tcpdump: listening on br0, link-type EN10MB (Ethernet), snapshot length 116 bytes
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 115 MBytes 963 Mbits/sec 21 334 KBytes
[ 5] 1.00-2.00 sec 113 MBytes 949 Mbits/sec 3 325 KBytes
[ 5] 2.00-3.00 sec 41.8 MBytes 350 Mbits/sec 216 5.70 KBytes
[ 5] 3.00-4.00 sec 113 MBytes 952 Mbits/sec 77 234 KBytes
[ 5] 4.00-5.00 sec 110 MBytes 927 Mbits/sec 5 281 KBytes
[ 5] 5.00-6.00 sec 69.5 MBytes 583 Mbits/sec 129 336 KBytes
[ 5] 6.00-7.00 sec 66.8 MBytes 561 Mbits/sec 234 302 KBytes
[ 5] 7.00-8.00 sec 113 MBytes 949 Mbits/sec 8 312 KBytes
[ 5] 8.00-9.00 sec 89.9 MBytes 754 Mbits/sec 72 247 KBytes
[ 5] 9.00-10.00 sec 113 MBytes 949 Mbits/sec 6 235 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 946 MBytes 794 Mbits/sec 771 sender <<<
[ 5] 0.00-10.04 sec 944 MBytes 789 Mbits/sec receiver <<<
iperf Done.
145337 packets captured
146674 packets received by filter
0 packets dropped by kernel
[1] Terminated ( while true; do
date +%s.%N; ss -tenmoi; sleep 0.050;
done ) > /tmp/$WHEN-ss.txt
[root@hv2 ~]#
[2]- Terminated ( while true; do
date +%s.%N; nstat; sleep 0.050;
done ) > /tmp/$WHEN-nstat.txt
[3]+ Done tcpdump -i br0 -w /tmp/$WHEN-tcpdump.${eth}.pcap -n -s 116 -c 1000000 host 192.168.1.203
## After Revert
WHEN=after-revert-6.6.93
(while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/$WHEN-ss.txt &
nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done) > /tmp/$WHEN-nstat.txt &
tcpdump -i br0 -w /tmp/$WHEN-tcpdump.${eth}.pcap -n -s 116 -c 1000000 host 192.168.1.203 &
iperf3 -c 192.168.1.203
kill %1 %2 %3
[1] 2088
nstat: history is aged out, resetting
[2] 2092
[3] 2093
Connecting to host 192.168.1.203, port 5201
dropped privs to tcpdump
tcpdump: listening on br0, link-type EN10MB (Ethernet), snapshot length 116 bytes
[ 5] local 192.168.1.52 port 47256 connected to 192.168.1.203 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 115 MBytes 962 Mbits/sec 13 324 KBytes
[ 5] 1.00-2.00 sec 114 MBytes 953 Mbits/sec 3 325 KBytes
[ 5] 2.00-3.00 sec 113 MBytes 947 Mbits/sec 4 321 KBytes
[ 5] 3.00-4.00 sec 113 MBytes 950 Mbits/sec 3 321 KBytes
[ 5] 4.00-5.00 sec 113 MBytes 946 Mbits/sec 5 322 KBytes
[ 5] 5.00-6.00 sec 113 MBytes 950 Mbits/sec 8 321 KBytes
[ 5] 6.00-7.00 sec 113 MBytes 948 Mbits/sec 5 312 KBytes
[ 5] 7.00-8.00 sec 113 MBytes 952 Mbits/sec 3 301 KBytes
[ 5] 8.00-9.00 sec 113 MBytes 945 Mbits/sec 7 301 KBytes
[ 5] 9.00-10.00 sec 114 MBytes 953 Mbits/sec 4 302 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.11 GBytes 950 Mbits/sec 55 sender
[ 5] 0.00-10.04 sec 1.10 GBytes 945 Mbits/sec receiver
iperf Done.
[root@hv2 ~]# 189249 packets captured
189450 packets received by filter
0 packets dropped by kernel
--
Eric Wheeler
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
2025-06-16 20:13 ` Eric Wheeler
@ 2025-06-16 21:07 ` Neal Cardwell
2025-06-18 22:03 ` Eric Wheeler
0 siblings, 1 reply; 19+ messages in thread
From: Neal Cardwell @ 2025-06-16 21:07 UTC (permalink / raw)
To: Eric Wheeler
Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
Yuchung Cheng, stable
On Mon, Jun 16, 2025 at 4:14 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
>
> On Sun, 15 Jun 2025, Eric Wheeler wrote:
> > On Tue, 10 Jun 2025, Neal Cardwell wrote:
> > > On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > >
> > > > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > >
> > > > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > >
> > > > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > >
> > > > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > >
> > > > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > >
> > > > > > > > > > Hello Neal,
> > > > > > > > > >
> > > > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > > > >
> > > > > > > > > > Interestingly, the problem only presents itself when transmitting
> > > > > > > > > > from Linux; receive traffic (to Linux) performs just fine:
> > > > > > > > > > ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE -> device
> > > > > > > > > > ~1Gbit: device =TX=> 1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > > > > > > > >
> > > > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > > > >
> > > > > > > > > > tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > > > upstream: e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > > > stable 6.6.y: e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > >
> >
> > > The attached patch should apply (with "git am") for any recent kernel
> > > that has the "tcp: fix to allow timestamp undo if no retransmits were
> > > sent" patch it is fixing. So you should be able to test it on top of
> > > the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> > > easier.
>
> Definitely better, but performance is ~15% slower vs reverting, and the
> retransmit counts are still higher than the other. In the two sections
> below you can see the difference between after the fix and after the
> revert.
>
> Here is the output:
>
> ## After fixing with your patch:
> https://www.linuxglobal.com/out/for-neal/after-fix.tar.gz
>
> WHEN=after-fix
> (while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/$WHEN-ss.txt &
> nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done) > /tmp/$WHEN-nstat.txt &
> tcpdump -i br0 -w /tmp/$WHEN-tcpdump.${eth}.pcap -n -s 116 -c 1000000 host 192.168.1.203 &
> iperf3 -c 192.168.1.203
> kill %1 %2 %3
>
> [1] 2300
> nstat: history is aged out, resetting
> [2] 2304
> [3] 2305
> Connecting to host 192.168.1.203, port 5201
> [ 5] local 192.168.1.52 port 47730 connected to 192.168.1.203 port 5201
> dropped privs to tcpdump
> tcpdump: listening on br0, link-type EN10MB (Ethernet), snapshot length 116 bytes
> [ ID] Interval Transfer Bitrate Retr Cwnd
> [ 5] 0.00-1.00 sec 115 MBytes 963 Mbits/sec 21 334 KBytes
> [ 5] 1.00-2.00 sec 113 MBytes 949 Mbits/sec 3 325 KBytes
> [ 5] 2.00-3.00 sec 41.8 MBytes 350 Mbits/sec 216 5.70 KBytes
> [ 5] 3.00-4.00 sec 113 MBytes 952 Mbits/sec 77 234 KBytes
> [ 5] 4.00-5.00 sec 110 MBytes 927 Mbits/sec 5 281 KBytes
> [ 5] 5.00-6.00 sec 69.5 MBytes 583 Mbits/sec 129 336 KBytes
> [ 5] 6.00-7.00 sec 66.8 MBytes 561 Mbits/sec 234 302 KBytes
> [ 5] 7.00-8.00 sec 113 MBytes 949 Mbits/sec 8 312 KBytes
> [ 5] 8.00-9.00 sec 89.9 MBytes 754 Mbits/sec 72 247 KBytes
> [ 5] 9.00-10.00 sec 113 MBytes 949 Mbits/sec 6 235 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate Retr
> [ 5] 0.00-10.00 sec 946 MBytes 794 Mbits/sec 771 sender <<<
> [ 5] 0.00-10.04 sec 944 MBytes 789 Mbits/sec receiver <<<
>
> iperf Done.
> 145337 packets captured
> 146674 packets received by filter
> 0 packets dropped by kernel
> [1] Terminated ( while true; do
> date +%s.%N; ss -tenmoi; sleep 0.050;
> done ) > /tmp/$WHEN-ss.txt
> [root@hv2 ~]#
> [2]- Terminated ( while true; do
> date +%s.%N; nstat; sleep 0.050;
> done ) > /tmp/$WHEN-nstat.txt
> [3]+ Done tcpdump -i br0 -w /tmp/$WHEN-tcpdump.${eth}.pcap -n -s 116 -c 1000000 host 192.168.1.203
>
> ## After Revert
> WHEN=after-revert-6.6.93
> (while true; do date +%s.%N; ss -tenmoi; sleep 0.050; done) > /tmp/$WHEN-ss.txt &
> nstat -n; (while true; do date +%s.%N; nstat; sleep 0.050; done) > /tmp/$WHEN-nstat.txt &
> tcpdump -i br0 -w /tmp/$WHEN-tcpdump.${eth}.pcap -n -s 116 -c 1000000 host 192.168.1.203 &
> iperf3 -c 192.168.1.203
> kill %1 %2 %3
> [1] 2088
> nstat: history is aged out, resetting
> [2] 2092
> [3] 2093
> Connecting to host 192.168.1.203, port 5201
> dropped privs to tcpdump
> tcpdump: listening on br0, link-type EN10MB (Ethernet), snapshot length 116 bytes
> [ 5] local 192.168.1.52 port 47256 connected to 192.168.1.203 port 5201
> [ ID] Interval Transfer Bitrate Retr Cwnd
> [ 5] 0.00-1.00 sec 115 MBytes 962 Mbits/sec 13 324 KBytes
> [ 5] 1.00-2.00 sec 114 MBytes 953 Mbits/sec 3 325 KBytes
> [ 5] 2.00-3.00 sec 113 MBytes 947 Mbits/sec 4 321 KBytes
> [ 5] 3.00-4.00 sec 113 MBytes 950 Mbits/sec 3 321 KBytes
> [ 5] 4.00-5.00 sec 113 MBytes 946 Mbits/sec 5 322 KBytes
> [ 5] 5.00-6.00 sec 113 MBytes 950 Mbits/sec 8 321 KBytes
> [ 5] 6.00-7.00 sec 113 MBytes 948 Mbits/sec 5 312 KBytes
> [ 5] 7.00-8.00 sec 113 MBytes 952 Mbits/sec 3 301 KBytes
> [ 5] 8.00-9.00 sec 113 MBytes 945 Mbits/sec 7 301 KBytes
> [ 5] 9.00-10.00 sec 114 MBytes 953 Mbits/sec 4 302 KBytes
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval Transfer Bitrate Retr
> [ 5] 0.00-10.00 sec 1.11 GBytes 950 Mbits/sec 55 sender
> [ 5] 0.00-10.04 sec 1.10 GBytes 945 Mbits/sec receiver
>
> iperf Done.
> [root@hv2 ~]# 189249 packets captured
> 189450 packets received by filter
> 0 packets dropped by kernel
Thanks for the test data!
Looking at the traces, there are no undo events, and no spurious loss
recovery events that I can see. So I don't see how the fix patch,
which changes undo behavior, would be relevant to the performance in
the test. It looks to me like the "after-fix" test just got unlucky
with packet losses, and because the receiver does not have SACK
support, any bad luck can easily turn into very poor performance, with
200ms timeouts during fast recovery.
Would you have cycles to run the "after-fix" and "after-revert-6.6.93"
cases multiple times, so we can get a sense of what is signal and what
is noise? Perhaps 20 or 50 trials for each approach?
Thanks!
neal
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
2025-06-16 21:07 ` Neal Cardwell
@ 2025-06-18 22:03 ` Eric Wheeler
2025-06-25 19:17 ` Eric Wheeler
0 siblings, 1 reply; 19+ messages in thread
From: Eric Wheeler @ 2025-06-18 22:03 UTC (permalink / raw)
To: Neal Cardwell
Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
Yuchung Cheng, stable
[-- Attachment #1: Type: text/plain, Size: 4954 bytes --]
On Mon, 16 Jun 2025, Neal Cardwell wrote:
> On Mon, Jun 16, 2025 at 4:14 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> >
> > On Sun, 15 Jun 2025, Eric Wheeler wrote:
> > > On Tue, 10 Jun 2025, Neal Cardwell wrote:
> > > > On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > >
> > > > > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > >
> > > > > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > >
> > > > > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hello Neal,
> > > > > > > > > > >
> > > > > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > > > > >
> > > > > > > > > > > Interestingly, the problem only presents itself when transmitting
> > > > > > > > > > > from Linux; receive traffic (to Linux) performs just fine:
> > > > > > > > > > > ~60Mbit: Linux v6.6.85 =TX=> 10GbE -> switch -> 1GbE -> device
> > > > > > > > > > > ~1Gbit: device =TX=> 1GbE -> switch -> 10GbE -> Linux v6.6.85
> > > > > > > > > > >
> > > > > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > > > > >
> > > > > > > > > > > tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > > > > upstream: e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > > > > stable 6.6.y: e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > >
> > >
> > > > The attached patch should apply (with "git am") for any recent kernel
> > > > that has the "tcp: fix to allow timestamp undo if no retransmits were
> > > > sent" patch it is fixing. So you should be able to test it on top of
> > > > the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> > > > easier.
> >
> > Definitely better, but performance is ~15% slower vs reverting, and the
> > retransmit counts are still higher than the other. In the two sections
> > below you can see the difference between after the fix and after the
> > revert.
> >
> > Here is the output:
> >
> > ## After fixing with your patch:
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval Transfer Bitrate Retr
> > [ 5] 0.00-10.00 sec 946 MBytes 794 Mbits/sec 771 sender <<<
> > [ 5] 0.00-10.04 sec 944 MBytes 789 Mbits/sec receiver <<<
> >
> > ## After Revert
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval Transfer Bitrate Retr
> > [ 5] 0.00-10.00 sec 1.11 GBytes 950 Mbits/sec 55 sender
> > [ 5] 0.00-10.04 sec 1.10 GBytes 945 Mbits/sec receiver
>
> Thanks for the test data!
>
> Looking at the traces, there are no undo events, and no spurious loss
> recovery events that I can see. So I don't see how the fix patch,
> which changes undo behavior, would be relevant to the performance in
> the test. It looks to me like the "after-fix" test just got unlucky
> with packet losses, and because the receiver does not have SACK
> support, any bad luck can easily turn into very poor performance, with
> 200ms timeouts during fast recovery.
>
> Would you have cycles to run the "after-fix" and "after-revert-6.6.93"
> cases multiple times, so we can get a sense of what is signal and what
> is noise? Perhaps 20 or 50 trials for each approach?
I ran 50 tests after revert and compare that to after the fix using both
average and geometric mean, and it still appears to be slightly slower
then with the revert alone:
# after-revert-6.6.93
Arithmetic Mean: 843.64 Mbits/sec
Geometric Mean: 841.95 Mbits/sec
# after-tcp-fix-6.6.93
Arithmetic Mean: 823.00 Mbits/sec
Geometric Mean: 819.38 Mbits/sec
Do you think that this is an actual performance regression, or just a
sample set that is not big enough to work out the averages?
Here is the data collected for each of the 50 tests:
- https://www.linuxglobal.com/out/for-neal/after-revert-6.6.93.tar.gz
- https://www.linuxglobal.com/out/for-neal/after-tcp-fix-6.6.93.tar.gz
--
Eric Wheeler
> Thanks!
> neal
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
2025-06-18 22:03 ` Eric Wheeler
@ 2025-06-25 19:17 ` Eric Wheeler
2025-06-25 20:19 ` Neal Cardwell
0 siblings, 1 reply; 19+ messages in thread
From: Eric Wheeler @ 2025-06-25 19:17 UTC (permalink / raw)
To: Neal Cardwell
Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
Yuchung Cheng, stable
[-- Attachment #1: Type: text/plain, Size: 3390 bytes --]
On Wed, 18 Jun 2025, Eric Wheeler wrote:
> On Mon, 16 Jun 2025, Neal Cardwell wrote:
> > On Mon, Jun 16, 2025 at 4:14 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > On Sun, 15 Jun 2025, Eric Wheeler wrote:
> > > > On Tue, 10 Jun 2025, Neal Cardwell wrote:
> > > > > On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > > > > > >
> > > > > > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > > > > > >
> > > > > > > > > > > > tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > > > > > upstream: e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > > > > > stable 6.6.y: e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > > >
> > > >
> > > > > The attached patch should apply (with "git am") for any recent kernel
> > > > > that has the "tcp: fix to allow timestamp undo if no retransmits were
> > > > > sent" patch it is fixing. So you should be able to test it on top of
> > > > > the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> > > > > easier.
> > >
> > > Definitely better, but performance is ~15% slower vs reverting, and the
> > > retransmit counts are still higher than the other. In the two sections
> > > below you can see the difference between after the fix and after the
> > > revert.
> > >
> >
> > Would you have cycles to run the "after-fix" and "after-revert-6.6.93"
> > cases multiple times, so we can get a sense of what is signal and what
> > is noise? Perhaps 20 or 50 trials for each approach?
>
> I ran 50 tests after revert and compare that to after the fix using both
> average and geometric mean, and it still appears to be slightly slower
> then with the revert alone:
>
> # after-revert-6.6.93
> Arithmetic Mean: 843.64 Mbits/sec
> Geometric Mean: 841.95 Mbits/sec
>
> # after-tcp-fix-6.6.93
> Arithmetic Mean: 823.00 Mbits/sec
> Geometric Mean: 819.38 Mbits/sec
>
Re-sending this question in case this message got lost:
> Do you think that this is an actual performance regression, or just a
> sample set that is not big enough to work out the averages?
>
> Here is the data collected for each of the 50 tests:
> - https://www.linuxglobal.com/out/for-neal/after-revert-6.6.93.tar.gz
> - https://www.linuxglobal.com/out/for-neal/after-tcp-fix-6.6.93.tar.gz
--
Eric Wheeler
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
2025-06-25 19:17 ` Eric Wheeler
@ 2025-06-25 20:19 ` Neal Cardwell
2025-06-25 23:15 ` Eric Wheeler
0 siblings, 1 reply; 19+ messages in thread
From: Neal Cardwell @ 2025-06-25 20:19 UTC (permalink / raw)
To: Eric Wheeler
Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
Yuchung Cheng, stable
On Wed, Jun 25, 2025 at 3:17 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
>
> On Wed, 18 Jun 2025, Eric Wheeler wrote:
> > On Mon, 16 Jun 2025, Neal Cardwell wrote:
> > > On Mon, Jun 16, 2025 at 4:14 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > On Sun, 15 Jun 2025, Eric Wheeler wrote:
> > > > > On Tue, 10 Jun 2025, Neal Cardwell wrote:
> > > > > > On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > > > > > > >
> > > > > > > > > > > > > tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > > > > > > upstream: e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > > > > > > stable 6.6.y: e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > > > >
> > > > >
> > > > > > The attached patch should apply (with "git am") for any recent kernel
> > > > > > that has the "tcp: fix to allow timestamp undo if no retransmits were
> > > > > > sent" patch it is fixing. So you should be able to test it on top of
> > > > > > the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> > > > > > easier.
> > > >
> > > > Definitely better, but performance is ~15% slower vs reverting, and the
> > > > retransmit counts are still higher than the other. In the two sections
> > > > below you can see the difference between after the fix and after the
> > > > revert.
> > > >
> > >
> > > Would you have cycles to run the "after-fix" and "after-revert-6.6.93"
> > > cases multiple times, so we can get a sense of what is signal and what
> > > is noise? Perhaps 20 or 50 trials for each approach?
> >
> > I ran 50 tests after revert and compare that to after the fix using both
> > average and geometric mean, and it still appears to be slightly slower
> > then with the revert alone:
> >
> > # after-revert-6.6.93
> > Arithmetic Mean: 843.64 Mbits/sec
> > Geometric Mean: 841.95 Mbits/sec
> >
> > # after-tcp-fix-6.6.93
> > Arithmetic Mean: 823.00 Mbits/sec
> > Geometric Mean: 819.38 Mbits/sec
> >
>
> Re-sending this question in case this message got lost:
>
> > Do you think that this is an actual performance regression, or just a
> > sample set that is not big enough to work out the averages?
> >
> > Here is the data collected for each of the 50 tests:
> > - https://www.linuxglobal.com/out/for-neal/after-revert-6.6.93.tar.gz
> > - https://www.linuxglobal.com/out/for-neal/after-tcp-fix-6.6.93.tar.gz
Hi Eric,
Many thanks for this great data!
I have been looking at this data. It's quite interesting.
Looking at the CDF of throughputs for the "revert" cases vs the "fix"
cases (attached) it does look like for the 70-th percentile and below
(the 70% of most unlucky cases), the "fix" cases have a throughput
that is lower, and IMHO this looks outside the realm of what we would
expect from noise.
However, when I look at the traces, I don't see any reason why the
"fix" cases would be systematically slower. In particular, the "fix"
and "revert" cases are only changing a function used for "undo"
decisions, but for both the "fix" or "revert" cases, there are no
"undo" events, and I don't see cases with spurious retransmissions
where there should have been "undo" events and yet there were not.
Visually inspecting the traces, the dominant determinant of
performance seems to be how many RTO events there were. For example,
the worst case for the "fix" trials has 16 RTOs, whereas the worst
case for the "revert" trials has 13 RTOs. And the number of RTO events
per trial looks random; I see similar qualitative patterns between
"fix" and "revert" cases, and don't see any reason why there are more
RTOs in the "fix" cases than the "revert" cases. All the RTOs seem to
be due to pre-existing (longstanding) performance problems in non-SACK
loss recovery.
One way to proceed would be for me to offer some performance fixes for
the RTOs, so we can get rid of the RTOs, which are the biggest source
of performance variation. That should greatly reduce noise, and
perhaps make it easier to see if there is any real difference between
"fix" and "revert" cases.
We could compare the following two kernels, with another 50 tests for
each of two kernels:
+ (a) 6.6.93 + {2 patches to fix RTOs} + "revert"
+ (b) 6.6.93 + {2 patches to fix RTOs} + "fix"
where:
"revert" = revert e37ab7373696 ("tcp: fix to allow timestamp undo if
no retransmits were sent")
"fix" = apply d0fa59897e04 ("tcp: fix tcp_packet_delayed() for
tcp_is_non_sack_preventing_reopen() behavior"
This would have the side benefit of testing some performance
improvements for non-SACK connections.
Are you up for that? :-)
Best regards,
neal
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
2025-06-25 20:19 ` Neal Cardwell
@ 2025-06-25 23:15 ` Eric Wheeler
2025-06-26 14:21 ` Neal Cardwell
0 siblings, 1 reply; 19+ messages in thread
From: Eric Wheeler @ 2025-06-25 23:15 UTC (permalink / raw)
To: Neal Cardwell
Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
Yuchung Cheng, stable
[-- Attachment #1: Type: text/plain, Size: 6339 bytes --]
On Wed, 25 Jun 2025, Neal Cardwell wrote:
> On Wed, Jun 25, 2025 at 3:17 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> >
> > On Wed, 18 Jun 2025, Eric Wheeler wrote:
> > > On Mon, 16 Jun 2025, Neal Cardwell wrote:
> > > > On Mon, Jun 16, 2025 at 4:14 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > On Sun, 15 Jun 2025, Eric Wheeler wrote:
> > > > > > On Tue, 10 Jun 2025, Neal Cardwell wrote:
> > > > > > > On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > > > > > > > upstream: e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > > > > > > > stable 6.6.y: e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > > > > >
> > > > > >
> > > > > > > The attached patch should apply (with "git am") for any recent kernel
> > > > > > > that has the "tcp: fix to allow timestamp undo if no retransmits were
> > > > > > > sent" patch it is fixing. So you should be able to test it on top of
> > > > > > > the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> > > > > > > easier.
> > > > >
> > > > > Definitely better, but performance is ~15% slower vs reverting, and the
> > > > > retransmit counts are still higher than the other. In the two sections
> > > > > below you can see the difference between after the fix and after the
> > > > > revert.
> > > > >
> > > >
> > > > Would you have cycles to run the "after-fix" and "after-revert-6.6.93"
> > > > cases multiple times, so we can get a sense of what is signal and what
> > > > is noise? Perhaps 20 or 50 trials for each approach?
> > >
> > > I ran 50 tests after revert and compare that to after the fix using both
> > > average and geometric mean, and it still appears to be slightly slower
> > > then with the revert alone:
> > >
> > > # after-revert-6.6.93
> > > Arithmetic Mean: 843.64 Mbits/sec
> > > Geometric Mean: 841.95 Mbits/sec
> > >
> > > # after-tcp-fix-6.6.93
> > > Arithmetic Mean: 823.00 Mbits/sec
> > > Geometric Mean: 819.38 Mbits/sec
> > >
> >
> > Re-sending this question in case this message got lost:
> >
> > > Do you think that this is an actual performance regression, or just a
> > > sample set that is not big enough to work out the averages?
> > >
> > > Here is the data collected for each of the 50 tests:
> > > - https://www.linuxglobal.com/out/for-neal/after-revert-6.6.93.tar.gz
> > > - https://www.linuxglobal.com/out/for-neal/after-tcp-fix-6.6.93.tar.gz
>
> Hi Eric,
>
> Many thanks for this great data!
>
> I have been looking at this data. It's quite interesting.
>
> Looking at the CDF of throughputs for the "revert" cases vs the "fix"
> cases (attached) it does look like for the 70-th percentile and below
> (the 70% of most unlucky cases), the "fix" cases have a throughput
> that is lower, and IMHO this looks outside the realm of what we would
> expect from noise.
>
> However, when I look at the traces, I don't see any reason why the
> "fix" cases would be systematically slower. In particular, the "fix"
> and "revert" cases are only changing a function used for "undo"
> decisions, but for both the "fix" or "revert" cases, there are no
> "undo" events, and I don't see cases with spurious retransmissions
> where there should have been "undo" events and yet there were not.
>
> Visually inspecting the traces, the dominant determinant of
> performance seems to be how many RTO events there were. For example,
> the worst case for the "fix" trials has 16 RTOs, whereas the worst
> case for the "revert" trials has 13 RTOs. And the number of RTO events
> per trial looks random; I see similar qualitative patterns between
> "fix" and "revert" cases, and don't see any reason why there are more
> RTOs in the "fix" cases than the "revert" cases. All the RTOs seem to
> be due to pre-existing (longstanding) performance problems in non-SACK
> loss recovery.
>
> One way to proceed would be for me to offer some performance fixes for
> the RTOs, so we can get rid of the RTOs, which are the biggest source
> of performance variation. That should greatly reduce noise, and
> perhaps make it easier to see if there is any real difference between
> "fix" and "revert" cases.
>
> We could compare the following two kernels, with another 50 tests for
> each of two kernels:
>
> + (a) 6.6.93 + {2 patches to fix RTOs} + "revert"
> + (b) 6.6.93 + {2 patches to fix RTOs} + "fix"
>
> where:
>
> "revert" = revert e37ab7373696 ("tcp: fix to allow timestamp undo if
> no retransmits were sent")
> "fix" = apply d0fa59897e04 ("tcp: fix tcp_packet_delayed() for
> tcp_is_non_sack_preventing_reopen() behavior"
>
> This would have the side benefit of testing some performance
> improvements for non-SACK connections.
>
> Are you up for that? :-)
Sure, if you have some patch ideas in mind, I'm all for getting patches
merged improve performance.
BTW, what causes a non-SACK connection? The RX side is a near-idle Linux
6.8 host default sysctl settings.
--
Eric Wheeler
>
> Best regards,
> neal
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
2025-06-25 23:15 ` Eric Wheeler
@ 2025-06-26 14:21 ` Neal Cardwell
2025-06-26 20:16 ` Eric Wheeler
0 siblings, 1 reply; 19+ messages in thread
From: Neal Cardwell @ 2025-06-26 14:21 UTC (permalink / raw)
To: Eric Wheeler
Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
Yuchung Cheng, stable
On Wed, Jun 25, 2025 at 7:15 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
>
> On Wed, 25 Jun 2025, Neal Cardwell wrote:
> > On Wed, Jun 25, 2025 at 3:17 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > >
> > > On Wed, 18 Jun 2025, Eric Wheeler wrote:
> > > > On Mon, 16 Jun 2025, Neal Cardwell wrote:
> > > > > On Mon, Jun 16, 2025 at 4:14 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > On Sun, 15 Jun 2025, Eric Wheeler wrote:
> > > > > > > On Tue, 10 Jun 2025, Neal Cardwell wrote:
> > > > > > > > On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > > > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > > > > > > > > upstream: e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > > > > > > > > stable 6.6.y: e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > > > > > >
> > > > > > >
> > > > > > > > The attached patch should apply (with "git am") for any recent kernel
> > > > > > > > that has the "tcp: fix to allow timestamp undo if no retransmits were
> > > > > > > > sent" patch it is fixing. So you should be able to test it on top of
> > > > > > > > the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> > > > > > > > easier.
> > > > > >
> > > > > > Definitely better, but performance is ~15% slower vs reverting, and the
> > > > > > retransmit counts are still higher than the other. In the two sections
> > > > > > below you can see the difference between after the fix and after the
> > > > > > revert.
> > > > > >
> > > > >
> > > > > Would you have cycles to run the "after-fix" and "after-revert-6.6.93"
> > > > > cases multiple times, so we can get a sense of what is signal and what
> > > > > is noise? Perhaps 20 or 50 trials for each approach?
> > > >
> > > > I ran 50 tests after revert and compare that to after the fix using both
> > > > average and geometric mean, and it still appears to be slightly slower
> > > > then with the revert alone:
> > > >
> > > > # after-revert-6.6.93
> > > > Arithmetic Mean: 843.64 Mbits/sec
> > > > Geometric Mean: 841.95 Mbits/sec
> > > >
> > > > # after-tcp-fix-6.6.93
> > > > Arithmetic Mean: 823.00 Mbits/sec
> > > > Geometric Mean: 819.38 Mbits/sec
> > > >
> > >
> > > Re-sending this question in case this message got lost:
> > >
> > > > Do you think that this is an actual performance regression, or just a
> > > > sample set that is not big enough to work out the averages?
> > > >
> > > > Here is the data collected for each of the 50 tests:
> > > > - https://www.linuxglobal.com/out/for-neal/after-revert-6.6.93.tar.gz
> > > > - https://www.linuxglobal.com/out/for-neal/after-tcp-fix-6.6.93.tar.gz
> >
> > Hi Eric,
> >
> > Many thanks for this great data!
> >
> > I have been looking at this data. It's quite interesting.
> >
> > Looking at the CDF of throughputs for the "revert" cases vs the "fix"
> > cases (attached) it does look like for the 70-th percentile and below
> > (the 70% of most unlucky cases), the "fix" cases have a throughput
> > that is lower, and IMHO this looks outside the realm of what we would
> > expect from noise.
> >
> > However, when I look at the traces, I don't see any reason why the
> > "fix" cases would be systematically slower. In particular, the "fix"
> > and "revert" cases are only changing a function used for "undo"
> > decisions, but for both the "fix" or "revert" cases, there are no
> > "undo" events, and I don't see cases with spurious retransmissions
> > where there should have been "undo" events and yet there were not.
> >
> > Visually inspecting the traces, the dominant determinant of
> > performance seems to be how many RTO events there were. For example,
> > the worst case for the "fix" trials has 16 RTOs, whereas the worst
> > case for the "revert" trials has 13 RTOs. And the number of RTO events
> > per trial looks random; I see similar qualitative patterns between
> > "fix" and "revert" cases, and don't see any reason why there are more
> > RTOs in the "fix" cases than the "revert" cases. All the RTOs seem to
> > be due to pre-existing (longstanding) performance problems in non-SACK
> > loss recovery.
> >
> > One way to proceed would be for me to offer some performance fixes for
> > the RTOs, so we can get rid of the RTOs, which are the biggest source
> > of performance variation. That should greatly reduce noise, and
> > perhaps make it easier to see if there is any real difference between
> > "fix" and "revert" cases.
> >
> > We could compare the following two kernels, with another 50 tests for
> > each of two kernels:
> >
> > + (a) 6.6.93 + {2 patches to fix RTOs} + "revert"
> > + (b) 6.6.93 + {2 patches to fix RTOs} + "fix"
> >
> > where:
> >
> > "revert" = revert e37ab7373696 ("tcp: fix to allow timestamp undo if
> > no retransmits were sent")
> > "fix" = apply d0fa59897e04 ("tcp: fix tcp_packet_delayed() for
> > tcp_is_non_sack_preventing_reopen() behavior"
> >
> > This would have the side benefit of testing some performance
> > improvements for non-SACK connections.
> >
> > Are you up for that? :-)
>
>
> Sure, if you have some patch ideas in mind, I'm all for getting patches
> merged improve performance.
Great! Thanks for being willing to do this! I will try to post some
patches ASAP.
> BTW, what causes a non-SACK connection? The RX side is a near-idle Linux
> 6.8 host default sysctl settings.
Given the RX side is a Linux 6.8 host, the kernel should be supporting
TCP SACK due to kernel compile-time defaults (see the
"net->ipv4.sysctl_tcp_sack = 1;" in net/ipv4/tcp_ipv4.c.
Given that factor, off-hand, I can think of only a few reasons why the
RX side would not negotiate SACK support:
(1) Some script or software on the RX machine has disabled SACK via
"sysctl net.ipv4.tcp_sack=0" or equivalent, perhaps at boot time (this
is easy to check with "sysctl net.ipv4.tcp_sack").
(2) There is a middlebox on the path (doing firewalling or NAT, etc)
that disables SACK
(3) There is a firewall rule on some machine or router/switch that disables SACK
Off-hand, I would think that (2) is the most likely case, since
intentionally disabling SACK via sysctl or firewall rule is
inadvisable and rare.
Any thoughts on which of these might be in play here?
Thanks,
neal
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent
2025-06-26 14:21 ` Neal Cardwell
@ 2025-06-26 20:16 ` Eric Wheeler
0 siblings, 0 replies; 19+ messages in thread
From: Eric Wheeler @ 2025-06-26 20:16 UTC (permalink / raw)
To: Neal Cardwell
Cc: netdev, Eric Dumazet, Geumhwan Yu, Jakub Kicinski, Sasha Levin,
Yuchung Cheng, stable
[-- Attachment #1: Type: text/plain, Size: 9075 bytes --]
On Thu, 26 Jun 2025, Neal Cardwell wrote:
> On Wed, Jun 25, 2025 at 7:15 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> >
> > On Wed, 25 Jun 2025, Neal Cardwell wrote:
> > > On Wed, Jun 25, 2025 at 3:17 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > >
> > > > On Wed, 18 Jun 2025, Eric Wheeler wrote:
> > > > > On Mon, 16 Jun 2025, Neal Cardwell wrote:
> > > > > > On Mon, Jun 16, 2025 at 4:14 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > On Sun, 15 Jun 2025, Eric Wheeler wrote:
> > > > > > > > On Tue, 10 Jun 2025, Neal Cardwell wrote:
> > > > > > > > > On Mon, Jun 9, 2025 at 1:45 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > > On Sat, Jun 7, 2025 at 7:26 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > > > On Sat, Jun 7, 2025 at 6:54 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > > > > On Sat, Jun 7, 2025 at 3:13 PM Neal Cardwell <ncardwell@google.com> wrote:
> > > > > > > > > > > > > On Fri, Jun 6, 2025 at 6:34 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > > > > > On Fri, 6 Jun 2025, Neal Cardwell wrote:
> > > > > > > > > > > > > > > On Thu, Jun 5, 2025 at 9:33 PM Eric Wheeler <netdev@lists.ewheeler.net> wrote:
> > > > > > > > > > > > > > > > After upgrading to Linux v6.6.85 on an older Supermicro SYS-2026T-6RFT+
> > > > > > > > > > > > > > > > with an Intel 82599ES 10GbE NIC (ixgbe) linked to a Netgear GS728TXS at
> > > > > > > > > > > > > > > > 10GbE via one SFP+ DAC (no bonding), we found TCP performance with
> > > > > > > > > > > > > > > > existing devices on 1Gbit ports was <60Mbit; however, TCP with devices
> > > > > > > > > > > > > > > > across the switch on 10Gbit ports runs at full 10GbE.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Through bisection, we found this first-bad commit:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > tcp: fix to allow timestamp undo if no retransmits were sent
> > > > > > > > > > > > > > > > upstream: e37ab7373696e650d3b6262a5b882aadad69bb9e
> > > > > > > > > > > > > > > > stable 6.6.y: e676ca60ad2a6fdeb718b5e7a337a8fb1591d45f
> > > > > > > > >
> > > > > > > >
> > > > > > > > > The attached patch should apply (with "git am") for any recent kernel
> > > > > > > > > that has the "tcp: fix to allow timestamp undo if no retransmits were
> > > > > > > > > sent" patch it is fixing. So you should be able to test it on top of
> > > > > > > > > the 6.6 stable or 6.15 stable kernels you used earlier. Whichever is
> > > > > > > > > easier.
> > > > > > >
> > > > > > > Definitely better, but performance is ~15% slower vs reverting, and the
> > > > > > > retransmit counts are still higher than the other. In the two sections
> > > > > > > below you can see the difference between after the fix and after the
> > > > > > > revert.
> > > > > > >
> > > > > >
> > > > > > Would you have cycles to run the "after-fix" and "after-revert-6.6.93"
> > > > > > cases multiple times, so we can get a sense of what is signal and what
> > > > > > is noise? Perhaps 20 or 50 trials for each approach?
> > > > >
> > > > > I ran 50 tests after revert and compare that to after the fix using both
> > > > > average and geometric mean, and it still appears to be slightly slower
> > > > > then with the revert alone:
> > > > >
> > > > > # after-revert-6.6.93
> > > > > Arithmetic Mean: 843.64 Mbits/sec
> > > > > Geometric Mean: 841.95 Mbits/sec
> > > > >
> > > > > # after-tcp-fix-6.6.93
> > > > > Arithmetic Mean: 823.00 Mbits/sec
> > > > > Geometric Mean: 819.38 Mbits/sec
> > > > >
> > > >
> > > > Re-sending this question in case this message got lost:
> > > >
> > > > > Do you think that this is an actual performance regression, or just a
> > > > > sample set that is not big enough to work out the averages?
> > > > >
> > > > > Here is the data collected for each of the 50 tests:
> > > > > - https://www.linuxglobal.com/out/for-neal/after-revert-6.6.93.tar.gz
> > > > > - https://www.linuxglobal.com/out/for-neal/after-tcp-fix-6.6.93.tar.gz
> > >
> > > Hi Eric,
> > >
> > > Many thanks for this great data!
> > >
> > > I have been looking at this data. It's quite interesting.
> > >
> > > Looking at the CDF of throughputs for the "revert" cases vs the "fix"
> > > cases (attached) it does look like for the 70-th percentile and below
> > > (the 70% of most unlucky cases), the "fix" cases have a throughput
> > > that is lower, and IMHO this looks outside the realm of what we would
> > > expect from noise.
> > >
> > > However, when I look at the traces, I don't see any reason why the
> > > "fix" cases would be systematically slower. In particular, the "fix"
> > > and "revert" cases are only changing a function used for "undo"
> > > decisions, but for both the "fix" or "revert" cases, there are no
> > > "undo" events, and I don't see cases with spurious retransmissions
> > > where there should have been "undo" events and yet there were not.
> > >
> > > Visually inspecting the traces, the dominant determinant of
> > > performance seems to be how many RTO events there were. For example,
> > > the worst case for the "fix" trials has 16 RTOs, whereas the worst
> > > case for the "revert" trials has 13 RTOs. And the number of RTO events
> > > per trial looks random; I see similar qualitative patterns between
> > > "fix" and "revert" cases, and don't see any reason why there are more
> > > RTOs in the "fix" cases than the "revert" cases. All the RTOs seem to
> > > be due to pre-existing (longstanding) performance problems in non-SACK
> > > loss recovery.
> > >
> > > One way to proceed would be for me to offer some performance fixes for
> > > the RTOs, so we can get rid of the RTOs, which are the biggest source
> > > of performance variation. That should greatly reduce noise, and
> > > perhaps make it easier to see if there is any real difference between
> > > "fix" and "revert" cases.
> > >
> > > We could compare the following two kernels, with another 50 tests for
> > > each of two kernels:
> > >
> > > + (a) 6.6.93 + {2 patches to fix RTOs} + "revert"
> > > + (b) 6.6.93 + {2 patches to fix RTOs} + "fix"
> > >
> > > where:
> > >
> > > "revert" = revert e37ab7373696 ("tcp: fix to allow timestamp undo if
> > > no retransmits were sent")
> > > "fix" = apply d0fa59897e04 ("tcp: fix tcp_packet_delayed() for
> > > tcp_is_non_sack_preventing_reopen() behavior"
> > >
> > > This would have the side benefit of testing some performance
> > > improvements for non-SACK connections.
> > >
> > > Are you up for that? :-)
> >
> >
> > Sure, if you have some patch ideas in mind, I'm all for getting patches
> > merged improve performance.
>
> Great! Thanks for being willing to do this! I will try to post some
> patches ASAP.
>
> > BTW, what causes a non-SACK connection? The RX side is a near-idle Linux
> > 6.8 host default sysctl settings.
>
> Given the RX side is a Linux 6.8 host, the kernel should be supporting
> TCP SACK due to kernel compile-time defaults (see the
> "net->ipv4.sysctl_tcp_sack = 1;" in net/ipv4/tcp_ipv4.c.
>
> Given that factor, off-hand, I can think of only a few reasons why the
> RX side would not negotiate SACK support:
>
> (1) Some script or software on the RX machine has disabled SACK via
> "sysctl net.ipv4.tcp_sack=0" or equivalent, perhaps at boot time (this
> is easy to check with "sysctl net.ipv4.tcp_sack").
It looks like you are right:
# cat /proc/sys/net/ipv4/tcp_sack
0
and it runs way faster after turning it on:
~]# iperf3 -c 192.168.1.203
Connecting to host 192.168.1.203, port 5201
[ 5] local 192.168.1.52 port 55104 connected to 192.168.1.203 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 115 MBytes 964 Mbits/sec 27 234 KBytes
[ 5] 1.00-2.00 sec 113 MBytes 949 Mbits/sec 7 242 KBytes
[ 5] 2.00-3.00 sec 113 MBytes 950 Mbits/sec 7 247 KBytes
[ 5] 3.00-4.00 sec 113 MBytes 947 Mbits/sec 8 261 KBytes
[ 5] 4.00-5.00 sec 114 MBytes 953 Mbits/sec 11 261 KBytes
[ 5] 5.00-6.00 sec 113 MBytes 948 Mbits/sec 9 261 KBytes
[ 5] 6.00-7.00 sec 113 MBytes 950 Mbits/sec 5 261 KBytes
[ 5] 7.00-8.00 sec 113 MBytes 947 Mbits/sec 10 272 KBytes
[ 5] 8.00-9.00 sec 113 MBytes 950 Mbits/sec 5 274 KBytes
[ 5] 9.00-10.00 sec 113 MBytes 947 Mbits/sec 6 275 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.00 sec 1.11 GBytes 950 Mbits/sec 95 sender
[ 5] 0.00-10.04 sec 1.11 GBytes 945 Mbits/sec receiver
Do you want to continue troubleshooting non-SACK performance since I
have a reliable way to reproduce the issue, or leave it here with "I
should have had sack enabled" ?
-Eric
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2025-06-26 20:16 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-06 1:32 [BISECT] regression: tcp: fix to allow timestamp undo if no retransmits were sent Eric Wheeler
2025-06-06 17:16 ` Neal Cardwell
2025-06-06 22:34 ` Eric Wheeler
2025-06-07 19:13 ` Neal Cardwell
2025-06-07 22:54 ` Neal Cardwell
2025-06-07 23:26 ` Neal Cardwell
2025-06-09 17:45 ` Neal Cardwell
2025-06-10 17:15 ` Neal Cardwell
2025-06-12 18:23 ` Neal Cardwell
2025-06-13 21:02 ` Neal Cardwell
2025-06-15 20:00 ` Eric Wheeler
2025-06-16 20:13 ` Eric Wheeler
2025-06-16 21:07 ` Neal Cardwell
2025-06-18 22:03 ` Eric Wheeler
2025-06-25 19:17 ` Eric Wheeler
2025-06-25 20:19 ` Neal Cardwell
2025-06-25 23:15 ` Eric Wheeler
2025-06-26 14:21 ` Neal Cardwell
2025-06-26 20:16 ` Eric Wheeler
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).