* Re: iperf performance regression since Linux 5.18
2023-10-14 19:40 ` Neal Cardwell
@ 2023-10-14 21:16 ` Dave Taht
2023-10-14 22:51 ` Eric Dumazet
2023-10-15 0:06 ` Stefan Wahren
2 siblings, 0 replies; 20+ messages in thread
From: Dave Taht @ 2023-10-14 21:16 UTC (permalink / raw)
To: Neal Cardwell
Cc: Stefan Wahren, Eric Dumazet, Jakub Kicinski, Fabio Estevam,
linux-imx, Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng
I am also kind of a huge believer in packet captures...
On Sat, Oct 14, 2023 at 12:41 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Fri, Oct 13, 2023 at 9:37 AM Stefan Wahren <wahrenst@gmx.net> wrote:
> >
> > Hi,
> >
> > Am 09.10.23 um 21:19 schrieb Neal Cardwell:
> > > On Mon, Oct 9, 2023 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
> > >> On Mon, Oct 9, 2023 at 8:58 PM Stefan Wahren <wahrenst@gmx.net> wrote:
> > >>> Hi,
> > >>> we recently switched on our ARM NXP i.MX6ULL based embedded device
> > >>> (Tarragon Master [1]) from an older kernel version to Linux 6.1. After
> > >>> that we noticed a measurable performance regression on the Ethernet
> > >>> interface (driver: fec, 100 Mbit link) while running iperf client on the
> > >>> device:
> > >>>
> > >>> BAD
> > >>>
> > >>> # iperf -t 10 -i 1 -c 192.168.1.129
> > >>> ------------------------------------------------------------
> > >>> Client connecting to 192.168.1.129, TCP port 5001
> > >>> TCP window size: 96.2 KByte (default)
> > >>> ------------------------------------------------------------
> > >>> [ 3] local 192.168.1.12 port 56022 connected with 192.168.1.129 port 5001
> > >>> [ ID] Interval Transfer Bandwidth
> > >>> [ 3] 0.0- 1.0 sec 9.88 MBytes 82.8 Mbits/sec
> > >>> [ 3] 1.0- 2.0 sec 9.62 MBytes 80.7 Mbits/sec
> > >>> [ 3] 2.0- 3.0 sec 9.75 MBytes 81.8 Mbits/sec
> > >>> [ 3] 3.0- 4.0 sec 9.62 MBytes 80.7 Mbits/sec
> > >>> [ 3] 4.0- 5.0 sec 9.62 MBytes 80.7 Mbits/sec
> > >>> [ 3] 5.0- 6.0 sec 9.62 MBytes 80.7 Mbits/sec
> > >>> [ 3] 6.0- 7.0 sec 9.50 MBytes 79.7 Mbits/sec
> > >>> [ 3] 7.0- 8.0 sec 9.75 MBytes 81.8 Mbits/sec
> > >>> [ 3] 8.0- 9.0 sec 9.62 MBytes 80.7 Mbits/sec
> > >>> [ 3] 9.0-10.0 sec 9.50 MBytes 79.7 Mbits/sec
> > >>> [ 3] 0.0-10.0 sec 96.5 MBytes 80.9 Mbits/sec
> > >>>
> > >>> GOOD
> > >>>
> > >>> # iperf -t 10 -i 1 -c 192.168.1.129
> > >>> ------------------------------------------------------------
> > >>> Client connecting to 192.168.1.129, TCP port 5001
> > >>> TCP window size: 96.2 KByte (default)
> > >>> ------------------------------------------------------------
> > >>> [ 3] local 192.168.1.12 port 54898 connected with 192.168.1.129 port 5001
> > >>> [ ID] Interval Transfer Bandwidth
> > >>> [ 3] 0.0- 1.0 sec 11.2 MBytes 94.4 Mbits/sec
> > >>> [ 3] 1.0- 2.0 sec 11.0 MBytes 92.3 Mbits/sec
> > >>> [ 3] 2.0- 3.0 sec 10.8 MBytes 90.2 Mbits/sec
> > >>> [ 3] 3.0- 4.0 sec 11.0 MBytes 92.3 Mbits/sec
> > >>> [ 3] 4.0- 5.0 sec 10.9 MBytes 91.2 Mbits/sec
> > >>> [ 3] 5.0- 6.0 sec 10.9 MBytes 91.2 Mbits/sec
> > >>> [ 3] 6.0- 7.0 sec 10.8 MBytes 90.2 Mbits/sec
> > >>> [ 3] 7.0- 8.0 sec 10.9 MBytes 91.2 Mbits/sec
> > >>> [ 3] 8.0- 9.0 sec 10.9 MBytes 91.2 Mbits/sec
> > >>> [ 3] 9.0-10.0 sec 10.9 MBytes 91.2 Mbits/sec
> > >>> [ 3] 0.0-10.0 sec 109 MBytes 91.4 Mbits/sec
> > >>>
> > >>> We were able to bisect this down to this commit:
> > >>>
> > >>> first bad commit: [65466904b015f6eeb9225b51aeb29b01a1d4b59c] tcp: adjust
> > >>> TSO packet sizes based on min_rtt
> > >>>
> > >>> Disabling this new setting via:
> > >>>
> > >>> echo 0 > /proc/sys/net/ipv4/tcp_tso_rtt_log
> > >>>
> > >>> confirm that this was the cause of the performance regression.
> > >>>
> > >>> Is it expected that the new default setting has such a performance impact?
> > > Indeed, thanks for the report.
> > >
> > > In addition to the "ss" output Eric mentioned, could you please grab
> > > "nstat" output, which should allow us to calculate the average TSO/GSO
> > > and LRO/GRO burst sizes, which is the key thing tuned with the
> > > tcp_tso_rtt_log knob.
> > >
> > > So it would be great to have the following from both data sender and
> > > data receiver, for both the good case and bad case, if you could start
> > > these before your test and kill them after the test stops:
> > >
> > > (while true; do date; ss -tenmoi; sleep 1; done) > /root/ss.txt &
> > > nstat -n; (while true; do date; nstat; sleep 1; done) > /root/nstat.txt
> > i upload everything here:
> > https://github.com/lategoodbye/tcp_tso_rtt_log_regress
> >
> > The server part is a Ubuntu installation connected to the internet. At
> > first i logged the good case, then i continued with the bad case.
> > Accidentally i delete a log file of bad case, so i repeated the whole
> > bad case again. So the uploaded bad case files are from the third run.
>
> Thanks for the detailed data!
>
> Here are some notes from looking at this data:
>
> + bad client: avg TSO burst size is roughly:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_bad.log
> IpOutRequests 308 44.7
> IpExtOutOctets 10050656 1403181.0
> est bytes per TSO burst: 10050656 / 308 = 32632
> est packets per TSO burst: 32632 / 1448 ~= 22.5
>
> + good client: avg TSO burst size is roughly:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_good.log
> IpOutRequests 529 62.0
> IpExtOutOctets 11502992 1288711.5
> est bytes per TSO burst: 11502992 / 529 ~= 21745
> est packets per TSO burst: 21745 / 1448 ~= 15.0
>
> + bad client ss data:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_bad.log
> State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
> ESTAB 0 236024 192.168.1.12:39228 192.168.1.129:5001
> timer:(on,030ms,0) ino:25876 sk:414f52af rto:0.21 cwnd:68 ssthresh:20
> reordering:0
> Mbits/sec allowed by cwnd: 68 * 1448 * 8 / .0018 / 1000000.0 ~= 437.6
>
> + good client ss data:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_good.log
> Fri Oct 13 15:04:36 CEST 2023
> State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
> ESTAB 0 425712 192.168.1.12:33284 192.168.1.129:5001
> timer:(on,020ms,0) ino:20654 sk:414f52af rto:0.21 cwnd:106 ssthresh:20
> reordering:0
> Mbits/sec allowed by cwnd: 106 * 1448 * 8 / .0028 / 1000000.0 = 438.5
>
> So it seems indeed like cwnd is not the limiting factor, and instead
> there is something about the larger TSO/GSO bursts (roughly 22.5
> packets per burst on average) in the "bad" case that is causing
> problems, and preventing the sender from keeping the pipe fully
> utilized.
>
> So perhaps the details of the tcp_tso_should_defer() logic are hurting
> performance?
>
> The default value of tcp_tso_win_divisor is 3, and in the bad case the
> cwnd / tcp_tso_win_divisor = 68 / 3 = 22.7 packets, which is
> suspiciously close to the average TSO burst size of 22.5. So my guess
> is that the tcp_tso_win_divisor of 3 is the dominant factor here, and
> perhaps if we raise it to 5, then 68/5 ~= 13.60 will approximate the
> TSO burst size in the "good" case, and fully utilize the pipe. So it
> seems worth an experiment, to see what we can learn.
>
> To test that theory, could you please try running the following as
> root on the data sender machine, and then re-running the "bad" test
> with tcp_tso_rtt_log at the default value of 9?
>
> sysctl net.ipv4.tcp_tso_win_divisor=5
>
> Thanks!
> neal
>
--
Oct 30: https://netdevconf.info/0x17/news/the-maestro-and-the-music-bof.html
Dave Täht CSO, LibreQos
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: iperf performance regression since Linux 5.18
2023-10-14 19:40 ` Neal Cardwell
2023-10-14 21:16 ` Dave Taht
@ 2023-10-14 22:51 ` Eric Dumazet
2023-10-14 23:24 ` Neal Cardwell
2023-10-14 23:26 ` Stefan Wahren
2023-10-15 0:06 ` Stefan Wahren
2 siblings, 2 replies; 20+ messages in thread
From: Eric Dumazet @ 2023-10-14 22:51 UTC (permalink / raw)
To: Neal Cardwell
Cc: Stefan Wahren, Jakub Kicinski, Fabio Estevam, linux-imx,
Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng
On Sat, Oct 14, 2023 at 9:40 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Fri, Oct 13, 2023 at 9:37 AM Stefan Wahren <wahrenst@gmx.net> wrote:
> >
> > Hi,
> >
> > Am 09.10.23 um 21:19 schrieb Neal Cardwell:
> > > On Mon, Oct 9, 2023 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
> > >> On Mon, Oct 9, 2023 at 8:58 PM Stefan Wahren <wahrenst@gmx.net> wrote:
> > >>> Hi,
> > >>> we recently switched on our ARM NXP i.MX6ULL based embedded device
> > >>> (Tarragon Master [1]) from an older kernel version to Linux 6.1. After
> > >>> that we noticed a measurable performance regression on the Ethernet
> > >>> interface (driver: fec, 100 Mbit link) while running iperf client on the
> > >>> device:
> > >>>
> > >>> BAD
> > >>>
> > >>> # iperf -t 10 -i 1 -c 192.168.1.129
> > >>> ------------------------------------------------------------
> > >>> Client connecting to 192.168.1.129, TCP port 5001
> > >>> TCP window size: 96.2 KByte (default)
> > >>> ------------------------------------------------------------
> > >>> [ 3] local 192.168.1.12 port 56022 connected with 192.168.1.129 port 5001
> > >>> [ ID] Interval Transfer Bandwidth
> > >>> [ 3] 0.0- 1.0 sec 9.88 MBytes 82.8 Mbits/sec
> > >>> [ 3] 1.0- 2.0 sec 9.62 MBytes 80.7 Mbits/sec
> > >>> [ 3] 2.0- 3.0 sec 9.75 MBytes 81.8 Mbits/sec
> > >>> [ 3] 3.0- 4.0 sec 9.62 MBytes 80.7 Mbits/sec
> > >>> [ 3] 4.0- 5.0 sec 9.62 MBytes 80.7 Mbits/sec
> > >>> [ 3] 5.0- 6.0 sec 9.62 MBytes 80.7 Mbits/sec
> > >>> [ 3] 6.0- 7.0 sec 9.50 MBytes 79.7 Mbits/sec
> > >>> [ 3] 7.0- 8.0 sec 9.75 MBytes 81.8 Mbits/sec
> > >>> [ 3] 8.0- 9.0 sec 9.62 MBytes 80.7 Mbits/sec
> > >>> [ 3] 9.0-10.0 sec 9.50 MBytes 79.7 Mbits/sec
> > >>> [ 3] 0.0-10.0 sec 96.5 MBytes 80.9 Mbits/sec
> > >>>
> > >>> GOOD
> > >>>
> > >>> # iperf -t 10 -i 1 -c 192.168.1.129
> > >>> ------------------------------------------------------------
> > >>> Client connecting to 192.168.1.129, TCP port 5001
> > >>> TCP window size: 96.2 KByte (default)
> > >>> ------------------------------------------------------------
> > >>> [ 3] local 192.168.1.12 port 54898 connected with 192.168.1.129 port 5001
> > >>> [ ID] Interval Transfer Bandwidth
> > >>> [ 3] 0.0- 1.0 sec 11.2 MBytes 94.4 Mbits/sec
> > >>> [ 3] 1.0- 2.0 sec 11.0 MBytes 92.3 Mbits/sec
> > >>> [ 3] 2.0- 3.0 sec 10.8 MBytes 90.2 Mbits/sec
> > >>> [ 3] 3.0- 4.0 sec 11.0 MBytes 92.3 Mbits/sec
> > >>> [ 3] 4.0- 5.0 sec 10.9 MBytes 91.2 Mbits/sec
> > >>> [ 3] 5.0- 6.0 sec 10.9 MBytes 91.2 Mbits/sec
> > >>> [ 3] 6.0- 7.0 sec 10.8 MBytes 90.2 Mbits/sec
> > >>> [ 3] 7.0- 8.0 sec 10.9 MBytes 91.2 Mbits/sec
> > >>> [ 3] 8.0- 9.0 sec 10.9 MBytes 91.2 Mbits/sec
> > >>> [ 3] 9.0-10.0 sec 10.9 MBytes 91.2 Mbits/sec
> > >>> [ 3] 0.0-10.0 sec 109 MBytes 91.4 Mbits/sec
> > >>>
> > >>> We were able to bisect this down to this commit:
> > >>>
> > >>> first bad commit: [65466904b015f6eeb9225b51aeb29b01a1d4b59c] tcp: adjust
> > >>> TSO packet sizes based on min_rtt
> > >>>
> > >>> Disabling this new setting via:
> > >>>
> > >>> echo 0 > /proc/sys/net/ipv4/tcp_tso_rtt_log
> > >>>
> > >>> confirm that this was the cause of the performance regression.
> > >>>
> > >>> Is it expected that the new default setting has such a performance impact?
> > > Indeed, thanks for the report.
> > >
> > > In addition to the "ss" output Eric mentioned, could you please grab
> > > "nstat" output, which should allow us to calculate the average TSO/GSO
> > > and LRO/GRO burst sizes, which is the key thing tuned with the
> > > tcp_tso_rtt_log knob.
> > >
> > > So it would be great to have the following from both data sender and
> > > data receiver, for both the good case and bad case, if you could start
> > > these before your test and kill them after the test stops:
> > >
> > > (while true; do date; ss -tenmoi; sleep 1; done) > /root/ss.txt &
> > > nstat -n; (while true; do date; nstat; sleep 1; done) > /root/nstat.txt
> > i upload everything here:
> > https://github.com/lategoodbye/tcp_tso_rtt_log_regress
> >
> > The server part is a Ubuntu installation connected to the internet. At
> > first i logged the good case, then i continued with the bad case.
> > Accidentally i delete a log file of bad case, so i repeated the whole
> > bad case again. So the uploaded bad case files are from the third run.
>
> Thanks for the detailed data!
>
> Here are some notes from looking at this data:
>
> + bad client: avg TSO burst size is roughly:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_bad.log
> IpOutRequests 308 44.7
> IpExtOutOctets 10050656 1403181.0
> est bytes per TSO burst: 10050656 / 308 = 32632
> est packets per TSO burst: 32632 / 1448 ~= 22.5
>
> + good client: avg TSO burst size is roughly:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_good.log
> IpOutRequests 529 62.0
> IpExtOutOctets 11502992 1288711.5
> est bytes per TSO burst: 11502992 / 529 ~= 21745
> est packets per TSO burst: 21745 / 1448 ~= 15.0
>
> + bad client ss data:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_bad.log
> State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
> ESTAB 0 236024 192.168.1.12:39228 192.168.1.129:5001
> timer:(on,030ms,0) ino:25876 sk:414f52af rto:0.21 cwnd:68 ssthresh:20
> reordering:0
> Mbits/sec allowed by cwnd: 68 * 1448 * 8 / .0018 / 1000000.0 ~= 437.6
>
> + good client ss data:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_good.log
> Fri Oct 13 15:04:36 CEST 2023
> State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
> ESTAB 0 425712 192.168.1.12:33284 192.168.1.129:5001
> timer:(on,020ms,0) ino:20654 sk:414f52af rto:0.21 cwnd:106 ssthresh:20
> reordering:0
> Mbits/sec allowed by cwnd: 106 * 1448 * 8 / .0028 / 1000000.0 = 438.5
>
> So it seems indeed like cwnd is not the limiting factor, and instead
> there is something about the larger TSO/GSO bursts (roughly 22.5
> packets per burst on average) in the "bad" case that is causing
> problems, and preventing the sender from keeping the pipe fully
> utilized.
>
> So perhaps the details of the tcp_tso_should_defer() logic are hurting
> performance?
>
> The default value of tcp_tso_win_divisor is 3, and in the bad case the
> cwnd / tcp_tso_win_divisor = 68 / 3 = 22.7 packets, which is
> suspiciously close to the average TSO burst size of 22.5. So my guess
> is that the tcp_tso_win_divisor of 3 is the dominant factor here, and
> perhaps if we raise it to 5, then 68/5 ~= 13.60 will approximate the
> TSO burst size in the "good" case, and fully utilize the pipe. So it
> seems worth an experiment, to see what we can learn.
>
> To test that theory, could you please try running the following as
> root on the data sender machine, and then re-running the "bad" test
> with tcp_tso_rtt_log at the default value of 9?
>
> sysctl net.ipv4.tcp_tso_win_divisor=5
>
> Thanks!
> neal
Hmm, we receive ~3200 acks per second, I am not sure the
tcp_tso_should_defer() logic
would hurt ?
Also the ss binary on the client seems very old, or its output has
been mangled perhaps ?
State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
ESTAB 0 492320 192.168.1.12:33284 192.168.1.129:5001
timer:(on,030ms,0) ino:20654 sk:414f52af rto:0.21 cwnd:106 ssthresh:20
reordering:0
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: iperf performance regression since Linux 5.18
2023-10-14 22:51 ` Eric Dumazet
@ 2023-10-14 23:24 ` Neal Cardwell
2023-10-14 23:26 ` Stefan Wahren
1 sibling, 0 replies; 20+ messages in thread
From: Neal Cardwell @ 2023-10-14 23:24 UTC (permalink / raw)
To: Eric Dumazet
Cc: Stefan Wahren, Jakub Kicinski, Fabio Estevam, linux-imx,
Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng
On Sat, Oct 14, 2023 at 6:51 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sat, Oct 14, 2023 at 9:40 PM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Fri, Oct 13, 2023 at 9:37 AM Stefan Wahren <wahrenst@gmx.net> wrote:
> > >
> > > Hi,
> > >
> > > Am 09.10.23 um 21:19 schrieb Neal Cardwell:
> > > > On Mon, Oct 9, 2023 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >> On Mon, Oct 9, 2023 at 8:58 PM Stefan Wahren <wahrenst@gmx.net> wrote:
> > > >>> Hi,
> > > >>> we recently switched on our ARM NXP i.MX6ULL based embedded device
> > > >>> (Tarragon Master [1]) from an older kernel version to Linux 6.1. After
> > > >>> that we noticed a measurable performance regression on the Ethernet
> > > >>> interface (driver: fec, 100 Mbit link) while running iperf client on the
> > > >>> device:
> > > >>>
> > > >>> BAD
> > > >>>
> > > >>> # iperf -t 10 -i 1 -c 192.168.1.129
> > > >>> ------------------------------------------------------------
> > > >>> Client connecting to 192.168.1.129, TCP port 5001
> > > >>> TCP window size: 96.2 KByte (default)
> > > >>> ------------------------------------------------------------
> > > >>> [ 3] local 192.168.1.12 port 56022 connected with 192.168.1.129 port 5001
> > > >>> [ ID] Interval Transfer Bandwidth
> > > >>> [ 3] 0.0- 1.0 sec 9.88 MBytes 82.8 Mbits/sec
> > > >>> [ 3] 1.0- 2.0 sec 9.62 MBytes 80.7 Mbits/sec
> > > >>> [ 3] 2.0- 3.0 sec 9.75 MBytes 81.8 Mbits/sec
> > > >>> [ 3] 3.0- 4.0 sec 9.62 MBytes 80.7 Mbits/sec
> > > >>> [ 3] 4.0- 5.0 sec 9.62 MBytes 80.7 Mbits/sec
> > > >>> [ 3] 5.0- 6.0 sec 9.62 MBytes 80.7 Mbits/sec
> > > >>> [ 3] 6.0- 7.0 sec 9.50 MBytes 79.7 Mbits/sec
> > > >>> [ 3] 7.0- 8.0 sec 9.75 MBytes 81.8 Mbits/sec
> > > >>> [ 3] 8.0- 9.0 sec 9.62 MBytes 80.7 Mbits/sec
> > > >>> [ 3] 9.0-10.0 sec 9.50 MBytes 79.7 Mbits/sec
> > > >>> [ 3] 0.0-10.0 sec 96.5 MBytes 80.9 Mbits/sec
> > > >>>
> > > >>> GOOD
> > > >>>
> > > >>> # iperf -t 10 -i 1 -c 192.168.1.129
> > > >>> ------------------------------------------------------------
> > > >>> Client connecting to 192.168.1.129, TCP port 5001
> > > >>> TCP window size: 96.2 KByte (default)
> > > >>> ------------------------------------------------------------
> > > >>> [ 3] local 192.168.1.12 port 54898 connected with 192.168.1.129 port 5001
> > > >>> [ ID] Interval Transfer Bandwidth
> > > >>> [ 3] 0.0- 1.0 sec 11.2 MBytes 94.4 Mbits/sec
> > > >>> [ 3] 1.0- 2.0 sec 11.0 MBytes 92.3 Mbits/sec
> > > >>> [ 3] 2.0- 3.0 sec 10.8 MBytes 90.2 Mbits/sec
> > > >>> [ 3] 3.0- 4.0 sec 11.0 MBytes 92.3 Mbits/sec
> > > >>> [ 3] 4.0- 5.0 sec 10.9 MBytes 91.2 Mbits/sec
> > > >>> [ 3] 5.0- 6.0 sec 10.9 MBytes 91.2 Mbits/sec
> > > >>> [ 3] 6.0- 7.0 sec 10.8 MBytes 90.2 Mbits/sec
> > > >>> [ 3] 7.0- 8.0 sec 10.9 MBytes 91.2 Mbits/sec
> > > >>> [ 3] 8.0- 9.0 sec 10.9 MBytes 91.2 Mbits/sec
> > > >>> [ 3] 9.0-10.0 sec 10.9 MBytes 91.2 Mbits/sec
> > > >>> [ 3] 0.0-10.0 sec 109 MBytes 91.4 Mbits/sec
> > > >>>
> > > >>> We were able to bisect this down to this commit:
> > > >>>
> > > >>> first bad commit: [65466904b015f6eeb9225b51aeb29b01a1d4b59c] tcp: adjust
> > > >>> TSO packet sizes based on min_rtt
> > > >>>
> > > >>> Disabling this new setting via:
> > > >>>
> > > >>> echo 0 > /proc/sys/net/ipv4/tcp_tso_rtt_log
> > > >>>
> > > >>> confirm that this was the cause of the performance regression.
> > > >>>
> > > >>> Is it expected that the new default setting has such a performance impact?
> > > > Indeed, thanks for the report.
> > > >
> > > > In addition to the "ss" output Eric mentioned, could you please grab
> > > > "nstat" output, which should allow us to calculate the average TSO/GSO
> > > > and LRO/GRO burst sizes, which is the key thing tuned with the
> > > > tcp_tso_rtt_log knob.
> > > >
> > > > So it would be great to have the following from both data sender and
> > > > data receiver, for both the good case and bad case, if you could start
> > > > these before your test and kill them after the test stops:
> > > >
> > > > (while true; do date; ss -tenmoi; sleep 1; done) > /root/ss.txt &
> > > > nstat -n; (while true; do date; nstat; sleep 1; done) > /root/nstat.txt
> > > i upload everything here:
> > > https://github.com/lategoodbye/tcp_tso_rtt_log_regress
> > >
> > > The server part is a Ubuntu installation connected to the internet. At
> > > first i logged the good case, then i continued with the bad case.
> > > Accidentally i delete a log file of bad case, so i repeated the whole
> > > bad case again. So the uploaded bad case files are from the third run.
> >
> > Thanks for the detailed data!
> >
> > Here are some notes from looking at this data:
> >
> > + bad client: avg TSO burst size is roughly:
> > https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_bad.log
> > IpOutRequests 308 44.7
> > IpExtOutOctets 10050656 1403181.0
> > est bytes per TSO burst: 10050656 / 308 = 32632
> > est packets per TSO burst: 32632 / 1448 ~= 22.5
> >
> > + good client: avg TSO burst size is roughly:
> > https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_good.log
> > IpOutRequests 529 62.0
> > IpExtOutOctets 11502992 1288711.5
> > est bytes per TSO burst: 11502992 / 529 ~= 21745
> > est packets per TSO burst: 21745 / 1448 ~= 15.0
> >
> > + bad client ss data:
> > https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_bad.log
> > State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
> > ESTAB 0 236024 192.168.1.12:39228 192.168.1.129:5001
> > timer:(on,030ms,0) ino:25876 sk:414f52af rto:0.21 cwnd:68 ssthresh:20
> > reordering:0
> > Mbits/sec allowed by cwnd: 68 * 1448 * 8 / .0018 / 1000000.0 ~= 437.6
> >
> > + good client ss data:
> > https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_good.log
> > Fri Oct 13 15:04:36 CEST 2023
> > State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
> > ESTAB 0 425712 192.168.1.12:33284 192.168.1.129:5001
> > timer:(on,020ms,0) ino:20654 sk:414f52af rto:0.21 cwnd:106 ssthresh:20
> > reordering:0
> > Mbits/sec allowed by cwnd: 106 * 1448 * 8 / .0028 / 1000000.0 = 438.5
> >
> > So it seems indeed like cwnd is not the limiting factor, and instead
> > there is something about the larger TSO/GSO bursts (roughly 22.5
> > packets per burst on average) in the "bad" case that is causing
> > problems, and preventing the sender from keeping the pipe fully
> > utilized.
> >
> > So perhaps the details of the tcp_tso_should_defer() logic are hurting
> > performance?
> >
> > The default value of tcp_tso_win_divisor is 3, and in the bad case the
> > cwnd / tcp_tso_win_divisor = 68 / 3 = 22.7 packets, which is
> > suspiciously close to the average TSO burst size of 22.5. So my guess
> > is that the tcp_tso_win_divisor of 3 is the dominant factor here, and
> > perhaps if we raise it to 5, then 68/5 ~= 13.60 will approximate the
> > TSO burst size in the "good" case, and fully utilize the pipe. So it
> > seems worth an experiment, to see what we can learn.
> >
> > To test that theory, could you please try running the following as
> > root on the data sender machine, and then re-running the "bad" test
> > with tcp_tso_rtt_log at the default value of 9?
> >
> > sysctl net.ipv4.tcp_tso_win_divisor=5
> >
> > Thanks!
> > neal
>
> Hmm, we receive ~3200 acks per second, I am not sure the
> tcp_tso_should_defer() logic
> would hurt ?
>
> Also the ss binary on the client seems very old, or its output has
> been mangled perhaps ?
>
> State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
> ESTAB 0 492320 192.168.1.12:33284 192.168.1.129:5001
> timer:(on,030ms,0) ino:20654 sk:414f52af rto:0.21 cwnd:106 ssthresh:20
> reordering:0
Yes, agreed, it would be super-useful if the next run could have both
a packet capture and newer/fuller ss output; perhaps something like:
(a) to run tcpdump on the data sender:
tcpdump -w /root/dump.pcap -n -s 116 -c 1000000 host $REMOTE_HOST -i
$INTERFACE &
(b) to build a newer ss:
git clone git://git.kernel.org/pub/scm/network/iproute2/iproute2.git
cd iproute2/
./configure
make
Thanks!
neal
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: iperf performance regression since Linux 5.18
2023-10-14 22:51 ` Eric Dumazet
2023-10-14 23:24 ` Neal Cardwell
@ 2023-10-14 23:26 ` Stefan Wahren
2023-10-15 10:23 ` Stefan Wahren
1 sibling, 1 reply; 20+ messages in thread
From: Stefan Wahren @ 2023-10-14 23:26 UTC (permalink / raw)
To: Eric Dumazet, Neal Cardwell
Cc: Jakub Kicinski, Fabio Estevam, linux-imx, Stefan Wahren,
Michael Heimpold, netdev, Yuchung Cheng
Hi Eric,
Am 15.10.23 um 00:51 schrieb Eric Dumazet:
> On Sat, Oct 14, 2023 at 9:40 PM Neal Cardwell <ncardwell@google.com> wrote:
>> On Fri, Oct 13, 2023 at 9:37 AM Stefan Wahren <wahrenst@gmx.net> wrote:
>>> Hi,
>>>
>>> Am 09.10.23 um 21:19 schrieb Neal Cardwell:
>>>> On Mon, Oct 9, 2023 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
>>>>> On Mon, Oct 9, 2023 at 8:58 PM Stefan Wahren <wahrenst@gmx.net> wrote:
>>>>>> Hi,
>>>>>> we recently switched on our ARM NXP i.MX6ULL based embedded device
>>>>>> (Tarragon Master [1]) from an older kernel version to Linux 6.1. After
>>>>>> that we noticed a measurable performance regression on the Ethernet
>>>>>> interface (driver: fec, 100 Mbit link) while running iperf client on the
>>>>>> device:
>>>>>>
>>>>>> BAD
>>>>>>
>>>>>> # iperf -t 10 -i 1 -c 192.168.1.129
>>>>>> ------------------------------------------------------------
>>>>>> Client connecting to 192.168.1.129, TCP port 5001
>>>>>> TCP window size: 96.2 KByte (default)
>>>>>> ------------------------------------------------------------
>>>>>> [ 3] local 192.168.1.12 port 56022 connected with 192.168.1.129 port 5001
>>>>>> [ ID] Interval Transfer Bandwidth
>>>>>> [ 3] 0.0- 1.0 sec 9.88 MBytes 82.8 Mbits/sec
>>>>>> [ 3] 1.0- 2.0 sec 9.62 MBytes 80.7 Mbits/sec
>>>>>> [ 3] 2.0- 3.0 sec 9.75 MBytes 81.8 Mbits/sec
>>>>>> [ 3] 3.0- 4.0 sec 9.62 MBytes 80.7 Mbits/sec
>>>>>> [ 3] 4.0- 5.0 sec 9.62 MBytes 80.7 Mbits/sec
>>>>>> [ 3] 5.0- 6.0 sec 9.62 MBytes 80.7 Mbits/sec
>>>>>> [ 3] 6.0- 7.0 sec 9.50 MBytes 79.7 Mbits/sec
>>>>>> [ 3] 7.0- 8.0 sec 9.75 MBytes 81.8 Mbits/sec
>>>>>> [ 3] 8.0- 9.0 sec 9.62 MBytes 80.7 Mbits/sec
>>>>>> [ 3] 9.0-10.0 sec 9.50 MBytes 79.7 Mbits/sec
>>>>>> [ 3] 0.0-10.0 sec 96.5 MBytes 80.9 Mbits/sec
>>>>>>
>>>>>> GOOD
>>>>>>
>>>>>> # iperf -t 10 -i 1 -c 192.168.1.129
>>>>>> ------------------------------------------------------------
>>>>>> Client connecting to 192.168.1.129, TCP port 5001
>>>>>> TCP window size: 96.2 KByte (default)
>>>>>> ------------------------------------------------------------
>>>>>> [ 3] local 192.168.1.12 port 54898 connected with 192.168.1.129 port 5001
>>>>>> [ ID] Interval Transfer Bandwidth
>>>>>> [ 3] 0.0- 1.0 sec 11.2 MBytes 94.4 Mbits/sec
>>>>>> [ 3] 1.0- 2.0 sec 11.0 MBytes 92.3 Mbits/sec
>>>>>> [ 3] 2.0- 3.0 sec 10.8 MBytes 90.2 Mbits/sec
>>>>>> [ 3] 3.0- 4.0 sec 11.0 MBytes 92.3 Mbits/sec
>>>>>> [ 3] 4.0- 5.0 sec 10.9 MBytes 91.2 Mbits/sec
>>>>>> [ 3] 5.0- 6.0 sec 10.9 MBytes 91.2 Mbits/sec
>>>>>> [ 3] 6.0- 7.0 sec 10.8 MBytes 90.2 Mbits/sec
>>>>>> [ 3] 7.0- 8.0 sec 10.9 MBytes 91.2 Mbits/sec
>>>>>> [ 3] 8.0- 9.0 sec 10.9 MBytes 91.2 Mbits/sec
>>>>>> [ 3] 9.0-10.0 sec 10.9 MBytes 91.2 Mbits/sec
>>>>>> [ 3] 0.0-10.0 sec 109 MBytes 91.4 Mbits/sec
>>>>>>
>>>>>> We were able to bisect this down to this commit:
>>>>>>
>>>>>> first bad commit: [65466904b015f6eeb9225b51aeb29b01a1d4b59c] tcp: adjust
>>>>>> TSO packet sizes based on min_rtt
>>>>>>
>>>>>> Disabling this new setting via:
>>>>>>
>>>>>> echo 0 > /proc/sys/net/ipv4/tcp_tso_rtt_log
>>>>>>
>>>>>> confirm that this was the cause of the performance regression.
>>>>>>
>>>>>> Is it expected that the new default setting has such a performance impact?
>>>> Indeed, thanks for the report.
>>>>
>>>> In addition to the "ss" output Eric mentioned, could you please grab
>>>> "nstat" output, which should allow us to calculate the average TSO/GSO
>>>> and LRO/GRO burst sizes, which is the key thing tuned with the
>>>> tcp_tso_rtt_log knob.
>>>>
>>>> So it would be great to have the following from both data sender and
>>>> data receiver, for both the good case and bad case, if you could start
>>>> these before your test and kill them after the test stops:
>>>>
>>>> (while true; do date; ss -tenmoi; sleep 1; done) > /root/ss.txt &
>>>> nstat -n; (while true; do date; nstat; sleep 1; done) > /root/nstat.txt
>>> i upload everything here:
>>> https://github.com/lategoodbye/tcp_tso_rtt_log_regress
>>>
>>> The server part is a Ubuntu installation connected to the internet. At
>>> first i logged the good case, then i continued with the bad case.
>>> Accidentally i delete a log file of bad case, so i repeated the whole
>>> bad case again. So the uploaded bad case files are from the third run.
>> Thanks for the detailed data!
>>
>> Here are some notes from looking at this data:
>>
>> + bad client: avg TSO burst size is roughly:
>> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_bad.log
>> IpOutRequests 308 44.7
>> IpExtOutOctets 10050656 1403181.0
>> est bytes per TSO burst: 10050656 / 308 = 32632
>> est packets per TSO burst: 32632 / 1448 ~= 22.5
>>
>> + good client: avg TSO burst size is roughly:
>> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_good.log
>> IpOutRequests 529 62.0
>> IpExtOutOctets 11502992 1288711.5
>> est bytes per TSO burst: 11502992 / 529 ~= 21745
>> est packets per TSO burst: 21745 / 1448 ~= 15.0
>>
>> + bad client ss data:
>> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_bad.log
>> State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
>> ESTAB 0 236024 192.168.1.12:39228 192.168.1.129:5001
>> timer:(on,030ms,0) ino:25876 sk:414f52af rto:0.21 cwnd:68 ssthresh:20
>> reordering:0
>> Mbits/sec allowed by cwnd: 68 * 1448 * 8 / .0018 / 1000000.0 ~= 437.6
>>
>> + good client ss data:
>> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_good.log
>> Fri Oct 13 15:04:36 CEST 2023
>> State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
>> ESTAB 0 425712 192.168.1.12:33284 192.168.1.129:5001
>> timer:(on,020ms,0) ino:20654 sk:414f52af rto:0.21 cwnd:106 ssthresh:20
>> reordering:0
>> Mbits/sec allowed by cwnd: 106 * 1448 * 8 / .0028 / 1000000.0 = 438.5
>>
>> So it seems indeed like cwnd is not the limiting factor, and instead
>> there is something about the larger TSO/GSO bursts (roughly 22.5
>> packets per burst on average) in the "bad" case that is causing
>> problems, and preventing the sender from keeping the pipe fully
>> utilized.
>>
>> So perhaps the details of the tcp_tso_should_defer() logic are hurting
>> performance?
>>
>> The default value of tcp_tso_win_divisor is 3, and in the bad case the
>> cwnd / tcp_tso_win_divisor = 68 / 3 = 22.7 packets, which is
>> suspiciously close to the average TSO burst size of 22.5. So my guess
>> is that the tcp_tso_win_divisor of 3 is the dominant factor here, and
>> perhaps if we raise it to 5, then 68/5 ~= 13.60 will approximate the
>> TSO burst size in the "good" case, and fully utilize the pipe. So it
>> seems worth an experiment, to see what we can learn.
>>
>> To test that theory, could you please try running the following as
>> root on the data sender machine, and then re-running the "bad" test
>> with tcp_tso_rtt_log at the default value of 9?
>>
>> sysctl net.ipv4.tcp_tso_win_divisor=5
>>
>> Thanks!
>> neal
> Hmm, we receive ~3200 acks per second, I am not sure the
> tcp_tso_should_defer() logic
> would hurt ?
>
> Also the ss binary on the client seems very old, or its output has
> been mangled perhaps ?
this binary is from Yocto kirkstone:
# ss --version
ss utility, iproute2-5.17.0
This shouldn't be too old. Maybe some missing kernel settings?
Maybe relevant extract from the config:
#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
# CONFIG_PROC_CHILDREN is not set
CONFIG_KERNFS=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
# CONFIG_TMPFS_POSIX_ACL is not set
# CONFIG_TMPFS_XATTR is not set
CONFIG_MEMFD_CREATE=y
CONFIG_CONFIGFS_FS=y
# end of Pseudo filesystems
CONFIG_NET=y
CONFIG_NET_EGRESS=y
#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_DIAG is not set
CONFIG_UNIX=y
CONFIG_UNIX_SCM=y
CONFIG_AF_UNIX_OOB=y
# CONFIG_UNIX_DIAG is not set
# CONFIG_TLS is not set
# CONFIG_XFRM_USER is not set
# CONFIG_NET_KEY is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
# CONFIG_IP_ADVANCED_ROUTER is not set
CONFIG_IP_PNP=y
CONFIG_IP_PNP_DHCP=y
# CONFIG_IP_PNP_BOOTP is not set
# CONFIG_IP_PNP_RARP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE_DEMUX is not set
# CONFIG_IP_MROUTE is not set
# CONFIG_SYN_COOKIES is not set
# CONFIG_NET_IPVTI is not set
# CONFIG_NET_FOU is not set
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
CONFIG_INET_TABLE_PERTURB_ORDER=16
# CONFIG_INET_DIAG is not set
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set
CONFIG_IPV6=y
# CONFIG_IPV6_ROUTER_PREF is not set
CONFIG_IPV6_OPTIMISTIC_DAD=y
# CONFIG_INET6_AH is not set
# CONFIG_INET6_ESP is not set
# CONFIG_INET6_IPCOMP is not set
# CONFIG_IPV6_MIP6 is not set
# CONFIG_IPV6_ILA is not set
# CONFIG_IPV6_VTI is not set
# CONFIG_IPV6_SIT is not set
# CONFIG_IPV6_TUNNEL is not set
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_IPV6_MROUTE is not set
# CONFIG_IPV6_SEG6_LWTUNNEL is not set
# CONFIG_IPV6_SEG6_HMAC is not set
# CONFIG_IPV6_RPL_LWTUNNEL is not set
# CONFIG_IPV6_IOAM6_LWTUNNEL is not set
# CONFIG_MPTCP is not set
# CONFIG_NETWORK_SECMARK is not set
CONFIG_NET_PTP_CLASSIFY=y
# CONFIG_NETWORK_PHY_TIMESTAMPING is not set
CONFIG_NETFILTER=y
CONFIG_NETFILTER_ADVANCED=y
# CONFIG_BRIDGE_NETFILTER is not set
#
# Core Netfilter Configuration
#
# CONFIG_NETFILTER_INGRESS is not set
CONFIG_NETFILTER_EGRESS=y
# CONFIG_NETFILTER_NETLINK_ACCT is not set
# CONFIG_NETFILTER_NETLINK_QUEUE is not set
# CONFIG_NETFILTER_NETLINK_LOG is not set
# CONFIG_NETFILTER_NETLINK_OSF is not set
CONFIG_NF_CONNTRACK=y
# CONFIG_NF_LOG_SYSLOG is not set
# CONFIG_NF_CONNTRACK_MARK is not set
# CONFIG_NF_CONNTRACK_ZONES is not set
# CONFIG_NF_CONNTRACK_PROCFS is not set
# CONFIG_NF_CONNTRACK_EVENTS is not set
# CONFIG_NF_CONNTRACK_TIMEOUT is not set
# CONFIG_NF_CONNTRACK_TIMESTAMP is not set
# CONFIG_NF_CONNTRACK_LABELS is not set
CONFIG_NF_CT_PROTO_DCCP=y
CONFIG_NF_CT_PROTO_SCTP=y
CONFIG_NF_CT_PROTO_UDPLITE=y
# CONFIG_NF_CONNTRACK_AMANDA is not set
CONFIG_NF_CONNTRACK_FTP=y
# CONFIG_NF_CONNTRACK_H323 is not set
# CONFIG_NF_CONNTRACK_IRC is not set
# CONFIG_NF_CONNTRACK_NETBIOS_NS is not set
# CONFIG_NF_CONNTRACK_SNMP is not set
# CONFIG_NF_CONNTRACK_PPTP is not set
# CONFIG_NF_CONNTRACK_SANE is not set
# CONFIG_NF_CONNTRACK_SIP is not set
# CONFIG_NF_CONNTRACK_TFTP is not set
# CONFIG_NF_CT_NETLINK is not set
CONFIG_NF_NAT=y
CONFIG_NF_NAT_FTP=y
CONFIG_NF_NAT_MASQUERADE=y
# CONFIG_NF_TABLES is not set
CONFIG_NETFILTER_XTABLES=y
#
# Xtables combined modules
#
# CONFIG_NETFILTER_XT_MARK is not set
# CONFIG_NETFILTER_XT_CONNMARK is not set
#
# Xtables targets
#
# CONFIG_NETFILTER_XT_TARGET_CLASSIFY is not set
# CONFIG_NETFILTER_XT_TARGET_CONNMARK is not set
# CONFIG_NETFILTER_XT_TARGET_HMARK is not set
# CONFIG_NETFILTER_XT_TARGET_IDLETIMER is not set
# CONFIG_NETFILTER_XT_TARGET_LED is not set
# CONFIG_NETFILTER_XT_TARGET_LOG is not set
# CONFIG_NETFILTER_XT_TARGET_MARK is not set
CONFIG_NETFILTER_XT_NAT=y
# CONFIG_NETFILTER_XT_TARGET_NETMAP is not set
# CONFIG_NETFILTER_XT_TARGET_NFLOG is not set
# CONFIG_NETFILTER_XT_TARGET_NFQUEUE is not set
# CONFIG_NETFILTER_XT_TARGET_RATEEST is not set
# CONFIG_NETFILTER_XT_TARGET_REDIRECT is not set
CONFIG_NETFILTER_XT_TARGET_MASQUERADE=y
# CONFIG_NETFILTER_XT_TARGET_TEE is not set
CONFIG_NETFILTER_XT_TARGET_TCPMSS=y
#
# Xtables matches
#
# CONFIG_NETFILTER_XT_MATCH_ADDRTYPE is not set
# CONFIG_NETFILTER_XT_MATCH_BPF is not set
# CONFIG_NETFILTER_XT_MATCH_CGROUP is not set
# CONFIG_NETFILTER_XT_MATCH_CLUSTER is not set
# CONFIG_NETFILTER_XT_MATCH_COMMENT is not set
# CONFIG_NETFILTER_XT_MATCH_CONNBYTES is not set
# CONFIG_NETFILTER_XT_MATCH_CONNLABEL is not set
# CONFIG_NETFILTER_XT_MATCH_CONNLIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_CONNMARK is not set
# CONFIG_NETFILTER_XT_MATCH_CONNTRACK is not set
# CONFIG_NETFILTER_XT_MATCH_CPU is not set
# CONFIG_NETFILTER_XT_MATCH_DCCP is not set
# CONFIG_NETFILTER_XT_MATCH_DEVGROUP is not set
# CONFIG_NETFILTER_XT_MATCH_DSCP is not set
# CONFIG_NETFILTER_XT_MATCH_ECN is not set
# CONFIG_NETFILTER_XT_MATCH_ESP is not set
# CONFIG_NETFILTER_XT_MATCH_HASHLIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_HELPER is not set
# CONFIG_NETFILTER_XT_MATCH_HL is not set
# CONFIG_NETFILTER_XT_MATCH_IPCOMP is not set
# CONFIG_NETFILTER_XT_MATCH_IPRANGE is not set
# CONFIG_NETFILTER_XT_MATCH_L2TP is not set
# CONFIG_NETFILTER_XT_MATCH_LENGTH is not set
# CONFIG_NETFILTER_XT_MATCH_LIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_MAC is not set
# CONFIG_NETFILTER_XT_MATCH_MARK is not set
# CONFIG_NETFILTER_XT_MATCH_MULTIPORT is not set
# CONFIG_NETFILTER_XT_MATCH_NFACCT is not set
# CONFIG_NETFILTER_XT_MATCH_OSF is not set
# CONFIG_NETFILTER_XT_MATCH_OWNER is not set
# CONFIG_NETFILTER_XT_MATCH_PKTTYPE is not set
# CONFIG_NETFILTER_XT_MATCH_QUOTA is not set
# CONFIG_NETFILTER_XT_MATCH_RATEEST is not set
# CONFIG_NETFILTER_XT_MATCH_REALM is not set
# CONFIG_NETFILTER_XT_MATCH_RECENT is not set
# CONFIG_NETFILTER_XT_MATCH_SCTP is not set
# CONFIG_NETFILTER_XT_MATCH_SOCKET is not set
# CONFIG_NETFILTER_XT_MATCH_STATE is not set
# CONFIG_NETFILTER_XT_MATCH_STATISTIC is not set
# CONFIG_NETFILTER_XT_MATCH_STRING is not set
# CONFIG_NETFILTER_XT_MATCH_TCPMSS is not set
# CONFIG_NETFILTER_XT_MATCH_TIME is not set
# CONFIG_NETFILTER_XT_MATCH_U32 is not set
# end of Core Netfilter Configuration
# CONFIG_IP_SET is not set
# CONFIG_IP_VS is not set
#
# IP: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV4=y
# CONFIG_NF_SOCKET_IPV4 is not set
# CONFIG_NF_TPROXY_IPV4 is not set
# CONFIG_NF_DUP_IPV4 is not set
# CONFIG_NF_LOG_ARP is not set
# CONFIG_NF_LOG_IPV4 is not set
# CONFIG_NF_REJECT_IPV4 is not set
CONFIG_IP_NF_IPTABLES=y
# CONFIG_IP_NF_MATCH_AH is not set
# CONFIG_IP_NF_MATCH_ECN is not set
# CONFIG_IP_NF_MATCH_TTL is not set
# CONFIG_IP_NF_FILTER is not set
# CONFIG_IP_NF_TARGET_SYNPROXY is not set
CONFIG_IP_NF_NAT=y
CONFIG_IP_NF_TARGET_MASQUERADE=y
# CONFIG_IP_NF_TARGET_NETMAP is not set
# CONFIG_IP_NF_TARGET_REDIRECT is not set
# CONFIG_IP_NF_MANGLE is not set
# CONFIG_IP_NF_RAW is not set
# CONFIG_IP_NF_ARPTABLES is not set
# end of IP: Netfilter Configuration
#
# IPv6: Netfilter Configuration
#
# CONFIG_NF_SOCKET_IPV6 is not set
# CONFIG_NF_TPROXY_IPV6 is not set
# CONFIG_NF_DUP_IPV6 is not set
# CONFIG_NF_REJECT_IPV6 is not set
# CONFIG_NF_LOG_IPV6 is not set
# CONFIG_IP6_NF_IPTABLES is not set
# end of IPv6: Netfilter Configuration
CONFIG_NF_DEFRAG_IPV6=y
# CONFIG_NF_CONNTRACK_BRIDGE is not set
# CONFIG_BRIDGE_NF_EBTABLES is not set
# CONFIG_BPFILTER is not set
# CONFIG_IP_DCCP is not set
# CONFIG_IP_SCTP is not set
# CONFIG_RDS is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_L2TP is not set
CONFIG_STP=y
CONFIG_BRIDGE=y
CONFIG_BRIDGE_IGMP_SNOOPING=y
CONFIG_BRIDGE_VLAN_FILTERING=y
# CONFIG_BRIDGE_MRP is not set
# CONFIG_BRIDGE_CFM is not set
# CONFIG_NET_DSA is not set
CONFIG_VLAN_8021Q=y
# CONFIG_VLAN_8021Q_GVRP is not set
# CONFIG_VLAN_8021Q_MVRP is not set
CONFIG_LLC=y
# CONFIG_LLC2 is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_PHONET is not set
# CONFIG_6LOWPAN is not set
# CONFIG_IEEE802154 is not set
# CONFIG_NET_SCHED is not set
# CONFIG_DCB is not set
CONFIG_DNS_RESOLVER=y
# CONFIG_BATMAN_ADV is not set
# CONFIG_OPENVSWITCH is not set
# CONFIG_VSOCKETS is not set
# CONFIG_NETLINK_DIAG is not set
# CONFIG_MPLS is not set
# CONFIG_NET_NSH is not set
# CONFIG_HSR is not set
# CONFIG_NET_SWITCHDEV is not set
# CONFIG_NET_L3_MASTER_DEV is not set
# CONFIG_QRTR is not set
# CONFIG_NET_NCSI is not set
# CONFIG_CGROUP_NET_PRIO is not set
# CONFIG_CGROUP_NET_CLASSID is not set
CONFIG_NET_RX_BUSY_POLL=y
CONFIG_BQL=y
#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# end of Network testing
# end of Networking options
# CONFIG_HAMRADIO is not set
CONFIG_CAN=y
CONFIG_CAN_RAW=y
CONFIG_CAN_BCM=y
# CONFIG_CAN_GW is not set
# CONFIG_CAN_J1939 is not set
# CONFIG_CAN_ISOTP is not set
# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set
# CONFIG_AF_KCM is not set
# CONFIG_MCTP is not set
CONFIG_WIRELESS=y
CONFIG_CFG80211=y
# CONFIG_NL80211_TESTMODE is not set
# CONFIG_CFG80211_DEVELOPER_WARNINGS is not set
# CONFIG_CFG80211_CERTIFICATION_ONUS is not set
CONFIG_CFG80211_REQUIRE_SIGNED_REGDB=y
CONFIG_CFG80211_USE_KERNEL_REGDB_KEYS=y
CONFIG_CFG80211_DEFAULT_PS=y
# CONFIG_CFG80211_DEBUGFS is not set
CONFIG_CFG80211_CRDA_SUPPORT=y
# CONFIG_CFG80211_WEXT is not set
CONFIG_MAC80211=y
CONFIG_MAC80211_HAS_RC=y
CONFIG_MAC80211_RC_MINSTREL=y
CONFIG_MAC80211_RC_DEFAULT_MINSTREL=y
CONFIG_MAC80211_RC_DEFAULT="minstrel_ht"
# CONFIG_MAC80211_MESH is not set
CONFIG_MAC80211_LEDS=y
# CONFIG_MAC80211_DEBUGFS is not set
# CONFIG_MAC80211_MESSAGE_TRACING is not set
# CONFIG_MAC80211_DEBUG_MENU is not set
CONFIG_MAC80211_STA_HASH_MAX_SIZE=0
# CONFIG_RFKILL is not set
# CONFIG_NET_9P is not set
# CONFIG_CAIF is not set
# CONFIG_CEPH_LIB is not set
# CONFIG_NFC is not set
# CONFIG_PSAMPLE is not set
# CONFIG_NET_IFE is not set
# CONFIG_LWTUNNEL is not set
CONFIG_NET_SELFTESTS=y
CONFIG_PAGE_POOL=y
# CONFIG_PAGE_POOL_STATS is not set
# CONFIG_FAILOVER is not set
CONFIG_ETHTOOL_NETLINK=y
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: iperf performance regression since Linux 5.18
2023-10-14 23:26 ` Stefan Wahren
@ 2023-10-15 10:23 ` Stefan Wahren
2023-10-16 9:49 ` Eric Dumazet
0 siblings, 1 reply; 20+ messages in thread
From: Stefan Wahren @ 2023-10-15 10:23 UTC (permalink / raw)
To: Eric Dumazet, Neal Cardwell
Cc: Jakub Kicinski, Fabio Estevam, linux-imx, Stefan Wahren,
Michael Heimpold, netdev, Yuchung Cheng
Hi,
Am 15.10.23 um 01:26 schrieb Stefan Wahren:
> Hi Eric,
>
> Am 15.10.23 um 00:51 schrieb Eric Dumazet:
>> On Sat, Oct 14, 2023 at 9:40 PM Neal Cardwell <ncardwell@google.com>
>> wrote:
...
>> Hmm, we receive ~3200 acks per second, I am not sure the
>> tcp_tso_should_defer() logic
>> would hurt ?
>>
>> Also the ss binary on the client seems very old, or its output has
>> been mangled perhaps ?
> this binary is from Yocto kirkstone:
>
> # ss --version
> ss utility, iproute2-5.17.0
>
> This shouldn't be too old. Maybe some missing kernel settings?
>
i think i was able to fix the issue by enable the proper kernel
settings. I rerun initial bad and good case again and overwrote the log
files:
https://github.com/lategoodbye/tcp_tso_rtt_log_regress/commit/93615c94ba1bf36bd47cc2b91dd44a3f58c601bc
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: iperf performance regression since Linux 5.18
2023-10-15 10:23 ` Stefan Wahren
@ 2023-10-16 9:49 ` Eric Dumazet
2023-10-16 10:35 ` Eric Dumazet
2023-10-16 18:21 ` Stefan Wahren
0 siblings, 2 replies; 20+ messages in thread
From: Eric Dumazet @ 2023-10-16 9:49 UTC (permalink / raw)
To: Stefan Wahren
Cc: Neal Cardwell, Jakub Kicinski, Fabio Estevam, linux-imx,
Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng
On Sun, Oct 15, 2023 at 12:23 PM Stefan Wahren <wahrenst@gmx.net> wrote:
>
> Hi,
>
> Am 15.10.23 um 01:26 schrieb Stefan Wahren:
> > Hi Eric,
> >
> > Am 15.10.23 um 00:51 schrieb Eric Dumazet:
> >> On Sat, Oct 14, 2023 at 9:40 PM Neal Cardwell <ncardwell@google.com>
> >> wrote:
> ...
> >> Hmm, we receive ~3200 acks per second, I am not sure the
> >> tcp_tso_should_defer() logic
> >> would hurt ?
> >>
> >> Also the ss binary on the client seems very old, or its output has
> >> been mangled perhaps ?
> > this binary is from Yocto kirkstone:
> >
> > # ss --version
> > ss utility, iproute2-5.17.0
> >
> > This shouldn't be too old. Maybe some missing kernel settings?
> >
> i think i was able to fix the issue by enable the proper kernel
> settings. I rerun initial bad and good case again and overwrote the log
> files:
>
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/commit/93615c94ba1bf36bd47cc2b91dd44a3f58c601bc
Excellent, thanks.
I see your kernel uses HZ=100, have you tried HZ=1000 by any chance ?
CONFIG_HZ_1000=y
CONFIG_HZ=1000
I see that the bad run seems to be stuck for a while with cwnd=66, but
a smaller amount of packets in flight (26 in following ss extract)
ESTAB 0 315664 192.168.1.12:60542 192.168.1.129:5001
timer:(on,030ms,0) ino:13011 sk:2 <->
skmem:(r0,rb131072,t48488,tb295680,f3696,w319888,o0,bl0,d0) ts sack
cubic wscale:7,6 rto:210 rtt:3.418/1.117 mss:1448 pmtu:1500 rcvmss:536
advmss:1448 cwnd:66 ssthresh:20 bytes_sent:43874400
bytes_acked:43836753 segs_out:30302 segs_in:14110 data_segs_out:30300
send 223681685bps lastsnd:10 lastrcv:4310 pacing_rate 268408200bps
delivery_rate 46336000bps delivered:30275 busy:4310ms unacked:26
rcv_space:14480 rcv_ssthresh:64088 notsent:278016 minrtt:0.744
I wonder if fec pseudo-tso code is adding some kind of artifacts,
maybe with TCP small queue logic.
(TX completion might be delayed too much on fec driver side)
Can you try
ethtool -K eth0 tso off ?
Alternatively I think I mentioned earlier that you could try to reduce
gso_max_size on a 100Mbit link
ip link set dev eth0 gso_max_size 16384
(Presumably a driver could do this tuning automatically)
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: iperf performance regression since Linux 5.18
2023-10-16 9:49 ` Eric Dumazet
@ 2023-10-16 10:35 ` Eric Dumazet
2023-10-16 18:25 ` Stefan Wahren
2023-10-16 18:21 ` Stefan Wahren
1 sibling, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2023-10-16 10:35 UTC (permalink / raw)
To: Stefan Wahren
Cc: Neal Cardwell, Jakub Kicinski, Fabio Estevam, linux-imx,
Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng
On Mon, Oct 16, 2023 at 11:49 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sun, Oct 15, 2023 at 12:23 PM Stefan Wahren <wahrenst@gmx.net> wrote:
> >
> > Hi,
> >
> > Am 15.10.23 um 01:26 schrieb Stefan Wahren:
> > > Hi Eric,
> > >
> > > Am 15.10.23 um 00:51 schrieb Eric Dumazet:
> > >> On Sat, Oct 14, 2023 at 9:40 PM Neal Cardwell <ncardwell@google.com>
> > >> wrote:
> > ...
> > >> Hmm, we receive ~3200 acks per second, I am not sure the
> > >> tcp_tso_should_defer() logic
> > >> would hurt ?
> > >>
> > >> Also the ss binary on the client seems very old, or its output has
> > >> been mangled perhaps ?
> > > this binary is from Yocto kirkstone:
> > >
> > > # ss --version
> > > ss utility, iproute2-5.17.0
> > >
> > > This shouldn't be too old. Maybe some missing kernel settings?
> > >
> > i think i was able to fix the issue by enable the proper kernel
> > settings. I rerun initial bad and good case again and overwrote the log
> > files:
> >
> > https://github.com/lategoodbye/tcp_tso_rtt_log_regress/commit/93615c94ba1bf36bd47cc2b91dd44a3f58c601bc
>
> Excellent, thanks.
>
> I see your kernel uses HZ=100, have you tried HZ=1000 by any chance ?
>
> CONFIG_HZ_1000=y
> CONFIG_HZ=1000
>
> I see that the bad run seems to be stuck for a while with cwnd=66, but
> a smaller amount of packets in flight (26 in following ss extract)
>
> ESTAB 0 315664 192.168.1.12:60542 192.168.1.129:5001
> timer:(on,030ms,0) ino:13011 sk:2 <->
> skmem:(r0,rb131072,t48488,tb295680,f3696,w319888,o0,bl0,d0) ts sack
> cubic wscale:7,6 rto:210 rtt:3.418/1.117 mss:1448 pmtu:1500 rcvmss:536
> advmss:1448 cwnd:66 ssthresh:20 bytes_sent:43874400
> bytes_acked:43836753 segs_out:30302 segs_in:14110 data_segs_out:30300
> send 223681685bps lastsnd:10 lastrcv:4310 pacing_rate 268408200bps
> delivery_rate 46336000bps delivered:30275 busy:4310ms unacked:26
> rcv_space:14480 rcv_ssthresh:64088 notsent:278016 minrtt:0.744
>
> I wonder if fec pseudo-tso code is adding some kind of artifacts,
> maybe with TCP small queue logic.
> (TX completion might be delayed too much on fec driver side)
Speaking of TSQ, it seems an old change (commit 75eefc6c59fd "tcp:
tsq: add a shortcut in tcp_small_queue_check()")
has been accidentally removed in 2017 (75c119afe14f "tcp: implement
rb-tree based retransmit queue")
Could you try this fix:
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 9c8c42c280b7638f0f4d94d68cd2c73e3c6c2bcc..e61a3a381d51b554ec8440928e22a290712f0b6b
100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2542,6 +2542,18 @@ static bool tcp_pacing_check(struct sock *sk)
return true;
}
+static bool tcp_rtx_queue_empty_or_single_skb(const struct sock *sk)
+{
+ const struct rb_node *node = sk->tcp_rtx_queue.rb_node;
+
+ /* No skb in the rtx queue. */
+ if (!node)
+ return true;
+
+ /* Only one skb in rtx queue. */
+ return !node->rb_left && !node->rb_right;
+}
+
/* TCP Small Queues :
* Control number of packets in qdisc/devices to two packets / or ~1 ms.
* (These limits are doubled for retransmits)
@@ -2579,12 +2591,12 @@ static bool tcp_small_queue_check(struct sock
*sk, const struct sk_buff *skb,
limit += extra_bytes;
}
if (refcount_read(&sk->sk_wmem_alloc) > limit) {
- /* Always send skb if rtx queue is empty.
+ /* Always send skb if rtx queue is empty or has one skb.
* No need to wait for TX completion to call us back,
* after softirq/tasklet schedule.
* This helps when TX completions are delayed too much.
*/
- if (tcp_rtx_queue_empty(sk))
+ if (tcp_rtx_queue_empty_or_single_skb(sk))
return false;
set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: iperf performance regression since Linux 5.18
2023-10-16 10:35 ` Eric Dumazet
@ 2023-10-16 18:25 ` Stefan Wahren
2023-10-16 18:47 ` Eric Dumazet
0 siblings, 1 reply; 20+ messages in thread
From: Stefan Wahren @ 2023-10-16 18:25 UTC (permalink / raw)
To: Eric Dumazet
Cc: Neal Cardwell, Jakub Kicinski, Fabio Estevam, linux-imx,
Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng
Hi Eric,
Am 16.10.23 um 12:35 schrieb Eric Dumazet:
> On Mon, Oct 16, 2023 at 11:49 AM Eric Dumazet <edumazet@google.com> wrote:
> Speaking of TSQ, it seems an old change (commit 75eefc6c59fd "tcp:
> tsq: add a shortcut in tcp_small_queue_check()")
> has been accidentally removed in 2017 (75c119afe14f "tcp: implement
> rb-tree based retransmit queue")
>
> Could you try this fix:
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 9c8c42c280b7638f0f4d94d68cd2c73e3c6c2bcc..e61a3a381d51b554ec8440928e22a290712f0b6b
> 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2542,6 +2542,18 @@ static bool tcp_pacing_check(struct sock *sk)
> return true;
> }
>
> +static bool tcp_rtx_queue_empty_or_single_skb(const struct sock *sk)
> +{
> + const struct rb_node *node = sk->tcp_rtx_queue.rb_node;
> +
> + /* No skb in the rtx queue. */
> + if (!node)
> + return true;
> +
> + /* Only one skb in rtx queue. */
> + return !node->rb_left && !node->rb_right;
> +}
> +
> /* TCP Small Queues :
> * Control number of packets in qdisc/devices to two packets / or ~1 ms.
> * (These limits are doubled for retransmits)
> @@ -2579,12 +2591,12 @@ static bool tcp_small_queue_check(struct sock
> *sk, const struct sk_buff *skb,
> limit += extra_bytes;
> }
> if (refcount_read(&sk->sk_wmem_alloc) > limit) {
> - /* Always send skb if rtx queue is empty.
> + /* Always send skb if rtx queue is empty or has one skb.
> * No need to wait for TX completion to call us back,
> * after softirq/tasklet schedule.
> * This helps when TX completions are delayed too much.
> */
> - if (tcp_rtx_queue_empty(sk))
> + if (tcp_rtx_queue_empty_or_single_skb(sk))
> return false;
>
> set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);
This patch applied on top of Linux 6.1.49, TSO on, gso_max_size 65535,
CONFIG_HZ_100=y
root@tarragon:/boot# iperf -t 10 -i 1 -c 192.168.1.129
------------------------------------------------------------
Client connecting to 192.168.1.129, TCP port 5001
TCP window size: 192 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.12 port 59714 connected with 192.168.1.129 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 11.5 MBytes 96.5 Mbits/sec
[ 3] 1.0- 2.0 sec 11.4 MBytes 95.4 Mbits/sec
[ 3] 2.0- 3.0 sec 11.1 MBytes 93.3 Mbits/sec
[ 3] 3.0- 4.0 sec 11.2 MBytes 94.4 Mbits/sec
[ 3] 4.0- 5.0 sec 11.1 MBytes 93.3 Mbits/sec
[ 3] 5.0- 6.0 sec 11.2 MBytes 94.4 Mbits/sec
[ 3] 6.0- 7.0 sec 11.2 MBytes 94.4 Mbits/sec
[ 3] 7.0- 8.0 sec 11.1 MBytes 93.3 Mbits/sec
[ 3] 8.0- 9.0 sec 11.4 MBytes 95.4 Mbits/sec
[ 3] 9.0-10.0 sec 11.2 MBytes 94.4 Mbits/sec
[ 3] 0.0-10.0 sec 113 MBytes 94.4 Mbits/sec
The figures are comparable to disabling TSO -> Good
Thanks
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: iperf performance regression since Linux 5.18
2023-10-16 18:25 ` Stefan Wahren
@ 2023-10-16 18:47 ` Eric Dumazet
2023-10-17 9:53 ` Stefan Wahren
0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2023-10-16 18:47 UTC (permalink / raw)
To: Stefan Wahren
Cc: Neal Cardwell, Jakub Kicinski, Fabio Estevam, linux-imx,
Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng
On Mon, Oct 16, 2023 at 8:25 PM Stefan Wahren <wahrenst@gmx.net> wrote:
>
> Hi Eric,
>
> Am 16.10.23 um 12:35 schrieb Eric Dumazet:
> > On Mon, Oct 16, 2023 at 11:49 AM Eric Dumazet <edumazet@google.com> wrote:
> > Speaking of TSQ, it seems an old change (commit 75eefc6c59fd "tcp:
> > tsq: add a shortcut in tcp_small_queue_check()")
> > has been accidentally removed in 2017 (75c119afe14f "tcp: implement
> > rb-tree based retransmit queue")
> >
> > Could you try this fix:
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 9c8c42c280b7638f0f4d94d68cd2c73e3c6c2bcc..e61a3a381d51b554ec8440928e22a290712f0b6b
> > 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -2542,6 +2542,18 @@ static bool tcp_pacing_check(struct sock *sk)
> > return true;
> > }
> >
> > +static bool tcp_rtx_queue_empty_or_single_skb(const struct sock *sk)
> > +{
> > + const struct rb_node *node = sk->tcp_rtx_queue.rb_node;
> > +
> > + /* No skb in the rtx queue. */
> > + if (!node)
> > + return true;
> > +
> > + /* Only one skb in rtx queue. */
> > + return !node->rb_left && !node->rb_right;
> > +}
> > +
> > /* TCP Small Queues :
> > * Control number of packets in qdisc/devices to two packets / or ~1 ms.
> > * (These limits are doubled for retransmits)
> > @@ -2579,12 +2591,12 @@ static bool tcp_small_queue_check(struct sock
> > *sk, const struct sk_buff *skb,
> > limit += extra_bytes;
> > }
> > if (refcount_read(&sk->sk_wmem_alloc) > limit) {
> > - /* Always send skb if rtx queue is empty.
> > + /* Always send skb if rtx queue is empty or has one skb.
> > * No need to wait for TX completion to call us back,
> > * after softirq/tasklet schedule.
> > * This helps when TX completions are delayed too much.
> > */
> > - if (tcp_rtx_queue_empty(sk))
> > + if (tcp_rtx_queue_empty_or_single_skb(sk))
> > return false;
> >
> > set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);
>
> This patch applied on top of Linux 6.1.49, TSO on, gso_max_size 65535,
> CONFIG_HZ_100=y
>
> root@tarragon:/boot# iperf -t 10 -i 1 -c 192.168.1.129
> ------------------------------------------------------------
> Client connecting to 192.168.1.129, TCP port 5001
> TCP window size: 192 KByte (default)
> ------------------------------------------------------------
> [ 3] local 192.168.1.12 port 59714 connected with 192.168.1.129 port 5001
> [ ID] Interval Transfer Bandwidth
> [ 3] 0.0- 1.0 sec 11.5 MBytes 96.5 Mbits/sec
> [ 3] 1.0- 2.0 sec 11.4 MBytes 95.4 Mbits/sec
> [ 3] 2.0- 3.0 sec 11.1 MBytes 93.3 Mbits/sec
> [ 3] 3.0- 4.0 sec 11.2 MBytes 94.4 Mbits/sec
> [ 3] 4.0- 5.0 sec 11.1 MBytes 93.3 Mbits/sec
> [ 3] 5.0- 6.0 sec 11.2 MBytes 94.4 Mbits/sec
> [ 3] 6.0- 7.0 sec 11.2 MBytes 94.4 Mbits/sec
> [ 3] 7.0- 8.0 sec 11.1 MBytes 93.3 Mbits/sec
> [ 3] 8.0- 9.0 sec 11.4 MBytes 95.4 Mbits/sec
> [ 3] 9.0-10.0 sec 11.2 MBytes 94.4 Mbits/sec
> [ 3] 0.0-10.0 sec 113 MBytes 94.4 Mbits/sec
>
> The figures are comparable to disabling TSO -> Good
>
> Thanks
Great. I suspect a very slow TX completion from fec then.
Could you use the following bpftrace program while your iperf is running ?
.bpftrace -e '
k:__dev_queue_xmit {
$skb = (struct sk_buff *)arg0;
if ($skb->fclone == 2) {
@start[$skb] = nsecs;
}
}
k:__kfree_skb {
$skb = (struct sk_buff *)arg0;
if ($skb->fclone == 2 && @start[$skb]) {
@tx_compl_usecs = hist((nsecs - @start[$skb])/1000);
delete(@start[$skb]);
}
}
END { clear(@start); }'
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: iperf performance regression since Linux 5.18
2023-10-16 18:47 ` Eric Dumazet
@ 2023-10-17 9:53 ` Stefan Wahren
2023-10-17 12:08 ` Eric Dumazet
0 siblings, 1 reply; 20+ messages in thread
From: Stefan Wahren @ 2023-10-17 9:53 UTC (permalink / raw)
To: Eric Dumazet
Cc: Neal Cardwell, Jakub Kicinski, Fabio Estevam, linux-imx,
Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng
Hi Eric,
Am 16.10.23 um 20:47 schrieb Eric Dumazet:
> On Mon, Oct 16, 2023 at 8:25 PM Stefan Wahren <wahrenst@gmx.net> wrote:
>> Hi Eric,
>>
>> Am 16.10.23 um 12:35 schrieb Eric Dumazet:
>>> On Mon, Oct 16, 2023 at 11:49 AM Eric Dumazet <edumazet@google.com> wrote:
>>> Speaking of TSQ, it seems an old change (commit 75eefc6c59fd "tcp:
>>> tsq: add a shortcut in tcp_small_queue_check()")
>>> has been accidentally removed in 2017 (75c119afe14f "tcp: implement
>>> rb-tree based retransmit queue")
>>>
>>> Could you try this fix:
>>>
>>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>>> index 9c8c42c280b7638f0f4d94d68cd2c73e3c6c2bcc..e61a3a381d51b554ec8440928e22a290712f0b6b
>>> 100644
>>> --- a/net/ipv4/tcp_output.c
>>> +++ b/net/ipv4/tcp_output.c
>>> @@ -2542,6 +2542,18 @@ static bool tcp_pacing_check(struct sock *sk)
>>> return true;
>>> }
>>>
>>> +static bool tcp_rtx_queue_empty_or_single_skb(const struct sock *sk)
>>> +{
>>> + const struct rb_node *node = sk->tcp_rtx_queue.rb_node;
>>> +
>>> + /* No skb in the rtx queue. */
>>> + if (!node)
>>> + return true;
>>> +
>>> + /* Only one skb in rtx queue. */
>>> + return !node->rb_left && !node->rb_right;
>>> +}
>>> +
>>> /* TCP Small Queues :
>>> * Control number of packets in qdisc/devices to two packets / or ~1 ms.
>>> * (These limits are doubled for retransmits)
>>> @@ -2579,12 +2591,12 @@ static bool tcp_small_queue_check(struct sock
>>> *sk, const struct sk_buff *skb,
>>> limit += extra_bytes;
>>> }
>>> if (refcount_read(&sk->sk_wmem_alloc) > limit) {
>>> - /* Always send skb if rtx queue is empty.
>>> + /* Always send skb if rtx queue is empty or has one skb.
>>> * No need to wait for TX completion to call us back,
>>> * after softirq/tasklet schedule.
>>> * This helps when TX completions are delayed too much.
>>> */
>>> - if (tcp_rtx_queue_empty(sk))
>>> + if (tcp_rtx_queue_empty_or_single_skb(sk))
>>> return false;
>>>
>>> set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);
>> This patch applied on top of Linux 6.1.49, TSO on, gso_max_size 65535,
>> CONFIG_HZ_100=y
>>
>> root@tarragon:/boot# iperf -t 10 -i 1 -c 192.168.1.129
>> ------------------------------------------------------------
>> Client connecting to 192.168.1.129, TCP port 5001
>> TCP window size: 192 KByte (default)
>> ------------------------------------------------------------
>> [ 3] local 192.168.1.12 port 59714 connected with 192.168.1.129 port 5001
>> [ ID] Interval Transfer Bandwidth
>> [ 3] 0.0- 1.0 sec 11.5 MBytes 96.5 Mbits/sec
>> [ 3] 1.0- 2.0 sec 11.4 MBytes 95.4 Mbits/sec
>> [ 3] 2.0- 3.0 sec 11.1 MBytes 93.3 Mbits/sec
>> [ 3] 3.0- 4.0 sec 11.2 MBytes 94.4 Mbits/sec
>> [ 3] 4.0- 5.0 sec 11.1 MBytes 93.3 Mbits/sec
>> [ 3] 5.0- 6.0 sec 11.2 MBytes 94.4 Mbits/sec
>> [ 3] 6.0- 7.0 sec 11.2 MBytes 94.4 Mbits/sec
>> [ 3] 7.0- 8.0 sec 11.1 MBytes 93.3 Mbits/sec
>> [ 3] 8.0- 9.0 sec 11.4 MBytes 95.4 Mbits/sec
>> [ 3] 9.0-10.0 sec 11.2 MBytes 94.4 Mbits/sec
>> [ 3] 0.0-10.0 sec 113 MBytes 94.4 Mbits/sec
>>
>> The figures are comparable to disabling TSO -> Good
>>
>> Thanks
> Great. I suspect a very slow TX completion from fec then.
>
> Could you use the following bpftrace program while your iperf is running ?
unfortuntely there is no bpftrace and most of its dependencies on my
platform. I looked at some guides and it seems to have a lot of (build)
dependencies. On a PC this won't be a problem, but on my ARM platform
there is only 1 GB eMMC space left.
Before investing a lot of time to get bpftrace running, is there an
alternative solution?
>
> .bpftrace -e '
> k:__dev_queue_xmit {
> $skb = (struct sk_buff *)arg0;
> if ($skb->fclone == 2) {
> @start[$skb] = nsecs;
> }
> }
> k:__kfree_skb {
> $skb = (struct sk_buff *)arg0;
> if ($skb->fclone == 2 && @start[$skb]) {
> @tx_compl_usecs = hist((nsecs - @start[$skb])/1000);
> delete(@start[$skb]);
> }
> }
> END { clear(@start); }'
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: iperf performance regression since Linux 5.18
2023-10-17 9:53 ` Stefan Wahren
@ 2023-10-17 12:08 ` Eric Dumazet
2023-10-17 12:17 ` Stefan Wahren
0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2023-10-17 12:08 UTC (permalink / raw)
To: Stefan Wahren
Cc: Neal Cardwell, Jakub Kicinski, Fabio Estevam, linux-imx,
Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng
On Tue, Oct 17, 2023 at 11:53 AM Stefan Wahren <wahrenst@gmx.net> wrote:
>
> Hi Eric,
>
> Am 16.10.23 um 20:47 schrieb Eric Dumazet:
> > On Mon, Oct 16, 2023 at 8:25 PM Stefan Wahren <wahrenst@gmx.net> wrote:
> >> Hi Eric,
> >>
> >> Am 16.10.23 um 12:35 schrieb Eric Dumazet:
> >>> On Mon, Oct 16, 2023 at 11:49 AM Eric Dumazet <edumazet@google.com> wrote:
> >>> Speaking of TSQ, it seems an old change (commit 75eefc6c59fd "tcp:
> >>> tsq: add a shortcut in tcp_small_queue_check()")
> >>> has been accidentally removed in 2017 (75c119afe14f "tcp: implement
> >>> rb-tree based retransmit queue")
> >>>
> >>> Could you try this fix:
> >>>
> >>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> >>> index 9c8c42c280b7638f0f4d94d68cd2c73e3c6c2bcc..e61a3a381d51b554ec8440928e22a290712f0b6b
> >>> 100644
> >>> --- a/net/ipv4/tcp_output.c
> >>> +++ b/net/ipv4/tcp_output.c
> >>> @@ -2542,6 +2542,18 @@ static bool tcp_pacing_check(struct sock *sk)
> >>> return true;
> >>> }
> >>>
> >>> +static bool tcp_rtx_queue_empty_or_single_skb(const struct sock *sk)
> >>> +{
> >>> + const struct rb_node *node = sk->tcp_rtx_queue.rb_node;
> >>> +
> >>> + /* No skb in the rtx queue. */
> >>> + if (!node)
> >>> + return true;
> >>> +
> >>> + /* Only one skb in rtx queue. */
> >>> + return !node->rb_left && !node->rb_right;
> >>> +}
> >>> +
> >>> /* TCP Small Queues :
> >>> * Control number of packets in qdisc/devices to two packets / or ~1 ms.
> >>> * (These limits are doubled for retransmits)
> >>> @@ -2579,12 +2591,12 @@ static bool tcp_small_queue_check(struct sock
> >>> *sk, const struct sk_buff *skb,
> >>> limit += extra_bytes;
> >>> }
> >>> if (refcount_read(&sk->sk_wmem_alloc) > limit) {
> >>> - /* Always send skb if rtx queue is empty.
> >>> + /* Always send skb if rtx queue is empty or has one skb.
> >>> * No need to wait for TX completion to call us back,
> >>> * after softirq/tasklet schedule.
> >>> * This helps when TX completions are delayed too much.
> >>> */
> >>> - if (tcp_rtx_queue_empty(sk))
> >>> + if (tcp_rtx_queue_empty_or_single_skb(sk))
> >>> return false;
> >>>
> >>> set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);
> >> This patch applied on top of Linux 6.1.49, TSO on, gso_max_size 65535,
> >> CONFIG_HZ_100=y
> >>
> >> root@tarragon:/boot# iperf -t 10 -i 1 -c 192.168.1.129
> >> ------------------------------------------------------------
> >> Client connecting to 192.168.1.129, TCP port 5001
> >> TCP window size: 192 KByte (default)
> >> ------------------------------------------------------------
> >> [ 3] local 192.168.1.12 port 59714 connected with 192.168.1.129 port 5001
> >> [ ID] Interval Transfer Bandwidth
> >> [ 3] 0.0- 1.0 sec 11.5 MBytes 96.5 Mbits/sec
> >> [ 3] 1.0- 2.0 sec 11.4 MBytes 95.4 Mbits/sec
> >> [ 3] 2.0- 3.0 sec 11.1 MBytes 93.3 Mbits/sec
> >> [ 3] 3.0- 4.0 sec 11.2 MBytes 94.4 Mbits/sec
> >> [ 3] 4.0- 5.0 sec 11.1 MBytes 93.3 Mbits/sec
> >> [ 3] 5.0- 6.0 sec 11.2 MBytes 94.4 Mbits/sec
> >> [ 3] 6.0- 7.0 sec 11.2 MBytes 94.4 Mbits/sec
> >> [ 3] 7.0- 8.0 sec 11.1 MBytes 93.3 Mbits/sec
> >> [ 3] 8.0- 9.0 sec 11.4 MBytes 95.4 Mbits/sec
> >> [ 3] 9.0-10.0 sec 11.2 MBytes 94.4 Mbits/sec
> >> [ 3] 0.0-10.0 sec 113 MBytes 94.4 Mbits/sec
> >>
> >> The figures are comparable to disabling TSO -> Good
> >>
> >> Thanks
> > Great. I suspect a very slow TX completion from fec then.
> >
> > Could you use the following bpftrace program while your iperf is running ?
> unfortuntely there is no bpftrace and most of its dependencies on my
> platform. I looked at some guides and it seems to have a lot of (build)
> dependencies. On a PC this won't be a problem, but on my ARM platform
> there is only 1 GB eMMC space left.
>
> Before investing a lot of time to get bpftrace running, is there an
> alternative solution?
No worries Stefan, I think we do not have to get precise numbers, I
will try to provide a debug patch (since you are able to build custom
kernels)
In the meantime, I will submit the official TCP patch, as you said it
was helping a lot.
Just to confirm, have you tried the patch on top of the latest net tree ?
Thanks.
> >
> > .bpftrace -e '
> > k:__dev_queue_xmit {
> > $skb = (struct sk_buff *)arg0;
> > if ($skb->fclone == 2) {
> > @start[$skb] = nsecs;
> > }
> > }
> > k:__kfree_skb {
> > $skb = (struct sk_buff *)arg0;
> > if ($skb->fclone == 2 && @start[$skb]) {
> > @tx_compl_usecs = hist((nsecs - @start[$skb])/1000);
> > delete(@start[$skb]);
> > }
> > }
> > END { clear(@start); }'
>
^ permalink raw reply [flat|nested] 20+ messages in thread* Re: iperf performance regression since Linux 5.18
2023-10-17 12:08 ` Eric Dumazet
@ 2023-10-17 12:17 ` Stefan Wahren
0 siblings, 0 replies; 20+ messages in thread
From: Stefan Wahren @ 2023-10-17 12:17 UTC (permalink / raw)
To: Eric Dumazet
Cc: Neal Cardwell, Jakub Kicinski, Fabio Estevam, linux-imx,
Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng
Am 17.10.23 um 14:08 schrieb Eric Dumazet:
> On Tue, Oct 17, 2023 at 11:53 AM Stefan Wahren <wahrenst@gmx.net> wrote:
>> Hi Eric,
>>
>> Am 16.10.23 um 20:47 schrieb Eric Dumazet:
>>> On Mon, Oct 16, 2023 at 8:25 PM Stefan Wahren <wahrenst@gmx.net> wrote:
>>>> Hi Eric,
>>>>
>>>> Am 16.10.23 um 12:35 schrieb Eric Dumazet:
>>>>> On Mon, Oct 16, 2023 at 11:49 AM Eric Dumazet <edumazet@google.com> wrote:
>>>>> Speaking of TSQ, it seems an old change (commit 75eefc6c59fd "tcp:
>>>>> tsq: add a shortcut in tcp_small_queue_check()")
>>>>> has been accidentally removed in 2017 (75c119afe14f "tcp: implement
>>>>> rb-tree based retransmit queue")
>>>>>
>>>>> Could you try this fix:
>>>>>
>>>>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>>>>> index 9c8c42c280b7638f0f4d94d68cd2c73e3c6c2bcc..e61a3a381d51b554ec8440928e22a290712f0b6b
>>>>> 100644
>>>>> --- a/net/ipv4/tcp_output.c
>>>>> +++ b/net/ipv4/tcp_output.c
>>>>> @@ -2542,6 +2542,18 @@ static bool tcp_pacing_check(struct sock *sk)
>>>>> return true;
>>>>> }
>>>>>
>>>>> +static bool tcp_rtx_queue_empty_or_single_skb(const struct sock *sk)
>>>>> +{
>>>>> + const struct rb_node *node = sk->tcp_rtx_queue.rb_node;
>>>>> +
>>>>> + /* No skb in the rtx queue. */
>>>>> + if (!node)
>>>>> + return true;
>>>>> +
>>>>> + /* Only one skb in rtx queue. */
>>>>> + return !node->rb_left && !node->rb_right;
>>>>> +}
>>>>> +
>>>>> /* TCP Small Queues :
>>>>> * Control number of packets in qdisc/devices to two packets / or ~1 ms.
>>>>> * (These limits are doubled for retransmits)
>>>>> @@ -2579,12 +2591,12 @@ static bool tcp_small_queue_check(struct sock
>>>>> *sk, const struct sk_buff *skb,
>>>>> limit += extra_bytes;
>>>>> }
>>>>> if (refcount_read(&sk->sk_wmem_alloc) > limit) {
>>>>> - /* Always send skb if rtx queue is empty.
>>>>> + /* Always send skb if rtx queue is empty or has one skb.
>>>>> * No need to wait for TX completion to call us back,
>>>>> * after softirq/tasklet schedule.
>>>>> * This helps when TX completions are delayed too much.
>>>>> */
>>>>> - if (tcp_rtx_queue_empty(sk))
>>>>> + if (tcp_rtx_queue_empty_or_single_skb(sk))
>>>>> return false;
>>>>>
>>>>> set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);
>>>> This patch applied on top of Linux 6.1.49, TSO on, gso_max_size 65535,
>>>> CONFIG_HZ_100=y
>>>>
>>>> root@tarragon:/boot# iperf -t 10 -i 1 -c 192.168.1.129
>>>> ------------------------------------------------------------
>>>> Client connecting to 192.168.1.129, TCP port 5001
>>>> TCP window size: 192 KByte (default)
>>>> ------------------------------------------------------------
>>>> [ 3] local 192.168.1.12 port 59714 connected with 192.168.1.129 port 5001
>>>> [ ID] Interval Transfer Bandwidth
>>>> [ 3] 0.0- 1.0 sec 11.5 MBytes 96.5 Mbits/sec
>>>> [ 3] 1.0- 2.0 sec 11.4 MBytes 95.4 Mbits/sec
>>>> [ 3] 2.0- 3.0 sec 11.1 MBytes 93.3 Mbits/sec
>>>> [ 3] 3.0- 4.0 sec 11.2 MBytes 94.4 Mbits/sec
>>>> [ 3] 4.0- 5.0 sec 11.1 MBytes 93.3 Mbits/sec
>>>> [ 3] 5.0- 6.0 sec 11.2 MBytes 94.4 Mbits/sec
>>>> [ 3] 6.0- 7.0 sec 11.2 MBytes 94.4 Mbits/sec
>>>> [ 3] 7.0- 8.0 sec 11.1 MBytes 93.3 Mbits/sec
>>>> [ 3] 8.0- 9.0 sec 11.4 MBytes 95.4 Mbits/sec
>>>> [ 3] 9.0-10.0 sec 11.2 MBytes 94.4 Mbits/sec
>>>> [ 3] 0.0-10.0 sec 113 MBytes 94.4 Mbits/sec
>>>>
>>>> The figures are comparable to disabling TSO -> Good
>>>>
>>>> Thanks
>>> Great. I suspect a very slow TX completion from fec then.
>>>
>>> Could you use the following bpftrace program while your iperf is running ?
>> unfortuntely there is no bpftrace and most of its dependencies on my
>> platform. I looked at some guides and it seems to have a lot of (build)
>> dependencies. On a PC this won't be a problem, but on my ARM platform
>> there is only 1 GB eMMC space left.
>>
>> Before investing a lot of time to get bpftrace running, is there an
>> alternative solution?
> No worries Stefan, I think we do not have to get precise numbers, I
> will try to provide a debug patch (since you are able to build custom
> kernels)
Great, this would be much easier for me.
>
> In the meantime, I will submit the official TCP patch, as you said it
> was helping a lot.
>
> Just to confirm, have you tried the patch on top of the latest net tree ?
Not yet (only Linux 6.1.49 stable), but i can do this today and come
back to you.
>
> Thanks.
>
>>> .bpftrace -e '
>>> k:__dev_queue_xmit {
>>> $skb = (struct sk_buff *)arg0;
>>> if ($skb->fclone == 2) {
>>> @start[$skb] = nsecs;
>>> }
>>> }
>>> k:__kfree_skb {
>>> $skb = (struct sk_buff *)arg0;
>>> if ($skb->fclone == 2 && @start[$skb]) {
>>> @tx_compl_usecs = hist((nsecs - @start[$skb])/1000);
>>> delete(@start[$skb]);
>>> }
>>> }
>>> END { clear(@start); }'
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: iperf performance regression since Linux 5.18
2023-10-16 9:49 ` Eric Dumazet
2023-10-16 10:35 ` Eric Dumazet
@ 2023-10-16 18:21 ` Stefan Wahren
1 sibling, 0 replies; 20+ messages in thread
From: Stefan Wahren @ 2023-10-16 18:21 UTC (permalink / raw)
To: Eric Dumazet
Cc: Neal Cardwell, Jakub Kicinski, Fabio Estevam, linux-imx,
Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng
Hi Eric,
Am 16.10.23 um 11:49 schrieb Eric Dumazet:
> On Sun, Oct 15, 2023 at 12:23 PM Stefan Wahren <wahrenst@gmx.net> wrote:
>> Hi,
>>
>> Am 15.10.23 um 01:26 schrieb Stefan Wahren:
>>> Hi Eric,
>>>
>>> Am 15.10.23 um 00:51 schrieb Eric Dumazet:
>>>> On Sat, Oct 14, 2023 at 9:40 PM Neal Cardwell <ncardwell@google.com>
>>>> wrote:
>> ...
>>>> Hmm, we receive ~3200 acks per second, I am not sure the
>>>> tcp_tso_should_defer() logic
>>>> would hurt ?
>>>>
>>>> Also the ss binary on the client seems very old, or its output has
>>>> been mangled perhaps ?
>>> this binary is from Yocto kirkstone:
>>>
>>> # ss --version
>>> ss utility, iproute2-5.17.0
>>>
>>> This shouldn't be too old. Maybe some missing kernel settings?
>>>
>> i think i was able to fix the issue by enable the proper kernel
>> settings. I rerun initial bad and good case again and overwrote the log
>> files:
>>
>> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/commit/93615c94ba1bf36bd47cc2b91dd44a3f58c601bc
> Excellent, thanks.
>
> I see your kernel uses HZ=100, have you tried HZ=1000 by any chance ?
>
> CONFIG_HZ_1000=y
> CONFIG_HZ=1000
i tried, but it doesn't have any influence.
> I see that the bad run seems to be stuck for a while with cwnd=66, but
> a smaller amount of packets in flight (26 in following ss extract)
>
> ESTAB 0 315664 192.168.1.12:60542 192.168.1.129:5001
> timer:(on,030ms,0) ino:13011 sk:2 <->
> skmem:(r0,rb131072,t48488,tb295680,f3696,w319888,o0,bl0,d0) ts sack
> cubic wscale:7,6 rto:210 rtt:3.418/1.117 mss:1448 pmtu:1500 rcvmss:536
> advmss:1448 cwnd:66 ssthresh:20 bytes_sent:43874400
> bytes_acked:43836753 segs_out:30302 segs_in:14110 data_segs_out:30300
> send 223681685bps lastsnd:10 lastrcv:4310 pacing_rate 268408200bps
> delivery_rate 46336000bps delivered:30275 busy:4310ms unacked:26
> rcv_space:14480 rcv_ssthresh:64088 notsent:278016 minrtt:0.744
>
> I wonder if fec pseudo-tso code is adding some kind of artifacts,
> maybe with TCP small queue logic.
> (TX completion might be delayed too much on fec driver side)
>
> Can you try
>
> ethtool -K eth0 tso off ?
TSO off, CONFIG_HZ_100=y
root@tarragon:~# iperf -t 10 -i 1 -c 192.168.1.129
------------------------------------------------------------
Client connecting to 192.168.1.129, TCP port 5001
TCP window size: 122 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.12 port 52326 connected with 192.168.1.129 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 11.6 MBytes 97.5 Mbits/sec
[ 3] 1.0- 2.0 sec 11.2 MBytes 94.4 Mbits/sec
[ 3] 2.0- 3.0 sec 11.2 MBytes 94.4 Mbits/sec
[ 3] 3.0- 4.0 sec 11.1 MBytes 93.3 Mbits/sec
[ 3] 4.0- 5.0 sec 11.2 MBytes 94.4 Mbits/sec
[ 3] 5.0- 6.0 sec 11.1 MBytes 93.3 Mbits/sec
[ 3] 6.0- 7.0 sec 11.4 MBytes 95.4 Mbits/sec
[ 3] 7.0- 8.0 sec 11.1 MBytes 93.3 Mbits/sec
[ 3] 8.0- 9.0 sec 11.5 MBytes 96.5 Mbits/sec
[ 3] 9.0-10.0 sec 11.1 MBytes 93.3 Mbits/sec
[ 3] 0.0-10.0 sec 113 MBytes 94.5 Mbits/sec
The figures seems slightly better than tcp_tso_rtt_log = 0 -> Good
> Alternatively I think I mentioned earlier that you could try to reduce
> gso_max_size on a 100Mbit link
>
> ip link set dev eth0 gso_max_size 16384
TSO on, gso_max_size 16384, CONFIG_HZ_100=y
root@tarragon:~# iperf -t 10 -i 1 -c 192.168.1.129
------------------------------------------------------------
Client connecting to 192.168.1.129, TCP port 5001
TCP window size: 101 KByte (default)
------------------------------------------------------------
[ 3] local 192.168.1.12 port 54548 connected with 192.168.1.129 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 11.4 MBytes 95.4 Mbits/sec
[ 3] 1.0- 2.0 sec 11.0 MBytes 92.3 Mbits/sec
[ 3] 2.0- 3.0 sec 10.9 MBytes 91.2 Mbits/sec
[ 3] 3.0- 4.0 sec 11.0 MBytes 92.3 Mbits/sec
[ 3] 4.0- 5.0 sec 10.6 MBytes 89.1 Mbits/sec
[ 3] 5.0- 6.0 sec 11.0 MBytes 92.3 Mbits/sec
[ 3] 6.0- 7.0 sec 10.9 MBytes 91.2 Mbits/sec
[ 3] 7.0- 8.0 sec 11.0 MBytes 92.3 Mbits/sec
[ 3] 8.0- 9.0 sec 10.8 MBytes 90.2 Mbits/sec
[ 3] 9.0-10.0 sec 10.9 MBytes 91.2 Mbits/sec
[ 3] 0.0-10.0 sec 109 MBytes 91.7 Mbits/sec
The figures are similiar to tcp_tso_rtt_log = 0 -> Good
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: iperf performance regression since Linux 5.18
2023-10-14 19:40 ` Neal Cardwell
2023-10-14 21:16 ` Dave Taht
2023-10-14 22:51 ` Eric Dumazet
@ 2023-10-15 0:06 ` Stefan Wahren
2 siblings, 0 replies; 20+ messages in thread
From: Stefan Wahren @ 2023-10-15 0:06 UTC (permalink / raw)
To: Neal Cardwell
Cc: Eric Dumazet, Jakub Kicinski, Fabio Estevam, linux-imx,
Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng
Am 14.10.23 um 21:40 schrieb Neal Cardwell:
> On Fri, Oct 13, 2023 at 9:37 AM Stefan Wahren <wahrenst@gmx.net> wrote:
>> Hi,
>>
>> Am 09.10.23 um 21:19 schrieb Neal Cardwell:
>>> On Mon, Oct 9, 2023 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
>>>> On Mon, Oct 9, 2023 at 8:58 PM Stefan Wahren <wahrenst@gmx.net> wrote:
>>>>> Hi,
>>>>> we recently switched on our ARM NXP i.MX6ULL based embedded device
>>>>> (Tarragon Master [1]) from an older kernel version to Linux 6.1. After
>>>>> that we noticed a measurable performance regression on the Ethernet
>>>>> interface (driver: fec, 100 Mbit link) while running iperf client on the
>>>>> device:
>>>>>
>>>>> BAD
>>>>>
>>>>> # iperf -t 10 -i 1 -c 192.168.1.129
>>>>> ------------------------------------------------------------
>>>>> Client connecting to 192.168.1.129, TCP port 5001
>>>>> TCP window size: 96.2 KByte (default)
>>>>> ------------------------------------------------------------
>>>>> [ 3] local 192.168.1.12 port 56022 connected with 192.168.1.129 port 5001
>>>>> [ ID] Interval Transfer Bandwidth
>>>>> [ 3] 0.0- 1.0 sec 9.88 MBytes 82.8 Mbits/sec
>>>>> [ 3] 1.0- 2.0 sec 9.62 MBytes 80.7 Mbits/sec
>>>>> [ 3] 2.0- 3.0 sec 9.75 MBytes 81.8 Mbits/sec
>>>>> [ 3] 3.0- 4.0 sec 9.62 MBytes 80.7 Mbits/sec
>>>>> [ 3] 4.0- 5.0 sec 9.62 MBytes 80.7 Mbits/sec
>>>>> [ 3] 5.0- 6.0 sec 9.62 MBytes 80.7 Mbits/sec
>>>>> [ 3] 6.0- 7.0 sec 9.50 MBytes 79.7 Mbits/sec
>>>>> [ 3] 7.0- 8.0 sec 9.75 MBytes 81.8 Mbits/sec
>>>>> [ 3] 8.0- 9.0 sec 9.62 MBytes 80.7 Mbits/sec
>>>>> [ 3] 9.0-10.0 sec 9.50 MBytes 79.7 Mbits/sec
>>>>> [ 3] 0.0-10.0 sec 96.5 MBytes 80.9 Mbits/sec
>>>>>
>>>>> GOOD
>>>>>
>>>>> # iperf -t 10 -i 1 -c 192.168.1.129
>>>>> ------------------------------------------------------------
>>>>> Client connecting to 192.168.1.129, TCP port 5001
>>>>> TCP window size: 96.2 KByte (default)
>>>>> ------------------------------------------------------------
>>>>> [ 3] local 192.168.1.12 port 54898 connected with 192.168.1.129 port 5001
>>>>> [ ID] Interval Transfer Bandwidth
>>>>> [ 3] 0.0- 1.0 sec 11.2 MBytes 94.4 Mbits/sec
>>>>> [ 3] 1.0- 2.0 sec 11.0 MBytes 92.3 Mbits/sec
>>>>> [ 3] 2.0- 3.0 sec 10.8 MBytes 90.2 Mbits/sec
>>>>> [ 3] 3.0- 4.0 sec 11.0 MBytes 92.3 Mbits/sec
>>>>> [ 3] 4.0- 5.0 sec 10.9 MBytes 91.2 Mbits/sec
>>>>> [ 3] 5.0- 6.0 sec 10.9 MBytes 91.2 Mbits/sec
>>>>> [ 3] 6.0- 7.0 sec 10.8 MBytes 90.2 Mbits/sec
>>>>> [ 3] 7.0- 8.0 sec 10.9 MBytes 91.2 Mbits/sec
>>>>> [ 3] 8.0- 9.0 sec 10.9 MBytes 91.2 Mbits/sec
>>>>> [ 3] 9.0-10.0 sec 10.9 MBytes 91.2 Mbits/sec
>>>>> [ 3] 0.0-10.0 sec 109 MBytes 91.4 Mbits/sec
>>>>>
>>>>> We were able to bisect this down to this commit:
>>>>>
>>>>> first bad commit: [65466904b015f6eeb9225b51aeb29b01a1d4b59c] tcp: adjust
>>>>> TSO packet sizes based on min_rtt
>>>>>
>>>>> Disabling this new setting via:
>>>>>
>>>>> echo 0 > /proc/sys/net/ipv4/tcp_tso_rtt_log
>>>>>
>>>>> confirm that this was the cause of the performance regression.
>>>>>
>>>>> Is it expected that the new default setting has such a performance impact?
>>> Indeed, thanks for the report.
>>>
>>> In addition to the "ss" output Eric mentioned, could you please grab
>>> "nstat" output, which should allow us to calculate the average TSO/GSO
>>> and LRO/GRO burst sizes, which is the key thing tuned with the
>>> tcp_tso_rtt_log knob.
>>>
>>> So it would be great to have the following from both data sender and
>>> data receiver, for both the good case and bad case, if you could start
>>> these before your test and kill them after the test stops:
>>>
>>> (while true; do date; ss -tenmoi; sleep 1; done) > /root/ss.txt &
>>> nstat -n; (while true; do date; nstat; sleep 1; done) > /root/nstat.txt
>> i upload everything here:
>> https://github.com/lategoodbye/tcp_tso_rtt_log_regress
>>
>> The server part is a Ubuntu installation connected to the internet. At
>> first i logged the good case, then i continued with the bad case.
>> Accidentally i delete a log file of bad case, so i repeated the whole
>> bad case again. So the uploaded bad case files are from the third run.
> Thanks for the detailed data!
>
> Here are some notes from looking at this data:
>
> + bad client: avg TSO burst size is roughly:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_bad.log
> IpOutRequests 308 44.7
> IpExtOutOctets 10050656 1403181.0
> est bytes per TSO burst: 10050656 / 308 = 32632
> est packets per TSO burst: 32632 / 1448 ~= 22.5
>
> + good client: avg TSO burst size is roughly:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_good.log
> IpOutRequests 529 62.0
> IpExtOutOctets 11502992 1288711.5
> est bytes per TSO burst: 11502992 / 529 ~= 21745
> est packets per TSO burst: 21745 / 1448 ~= 15.0
>
> + bad client ss data:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_bad.log
> State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
> ESTAB 0 236024 192.168.1.12:39228 192.168.1.129:5001
> timer:(on,030ms,0) ino:25876 sk:414f52af rto:0.21 cwnd:68 ssthresh:20
> reordering:0
> Mbits/sec allowed by cwnd: 68 * 1448 * 8 / .0018 / 1000000.0 ~= 437.6
>
> + good client ss data:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_good.log
> Fri Oct 13 15:04:36 CEST 2023
> State Recv-Q Send-Q Local Address:Port Peer Address:PortProcess
> ESTAB 0 425712 192.168.1.12:33284 192.168.1.129:5001
> timer:(on,020ms,0) ino:20654 sk:414f52af rto:0.21 cwnd:106 ssthresh:20
> reordering:0
> Mbits/sec allowed by cwnd: 106 * 1448 * 8 / .0028 / 1000000.0 = 438.5
>
> So it seems indeed like cwnd is not the limiting factor, and instead
> there is something about the larger TSO/GSO bursts (roughly 22.5
> packets per burst on average) in the "bad" case that is causing
> problems, and preventing the sender from keeping the pipe fully
> utilized.
>
> So perhaps the details of the tcp_tso_should_defer() logic are hurting
> performance?
>
> The default value of tcp_tso_win_divisor is 3, and in the bad case the
> cwnd / tcp_tso_win_divisor = 68 / 3 = 22.7 packets, which is
> suspiciously close to the average TSO burst size of 22.5. So my guess
> is that the tcp_tso_win_divisor of 3 is the dominant factor here, and
> perhaps if we raise it to 5, then 68/5 ~= 13.60 will approximate the
> TSO burst size in the "good" case, and fully utilize the pipe. So it
> seems worth an experiment, to see what we can learn.
>
> To test that theory, could you please try running the following as
> root on the data sender machine, and then re-running the "bad" test
> with tcp_tso_rtt_log at the default value of 9?
>
> sysctl net.ipv4.tcp_tso_win_divisor=5
Unfortunately this doesn't fix it.
Please look at the trace and sysctl settings [1]. I will try to figure
out what's wrong mit iproute2-ss later. CET says it's time to sleep.
[1] -
https://github.com/lategoodbye/tcp_tso_rtt_log_regress/commit/e1ceb689d7797eb10127613861d56cb3303f7b72
>
> Thanks!
> neal
^ permalink raw reply [flat|nested] 20+ messages in thread