iperf performance regression since Linux 5.18

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* iperf performance regression since Linux 5.18
@ 2023-10-09 18:58 Stefan Wahren
  2023-10-09 19:10 ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Wahren @ 2023-10-09 18:58 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski
  Cc: Neal Cardwell, Fabio Estevam, linux-imx, Stefan Wahren,
	Michael Heimpold, netdev

Hi,
we recently switched on our ARM NXP i.MX6ULL based embedded device
(Tarragon Master [1]) from an older kernel version to Linux 6.1. After
that we noticed a measurable performance regression on the Ethernet
interface (driver: fec, 100 Mbit link) while running iperf client on the
device:

BAD

# iperf -t 10 -i 1 -c 192.168.1.129
------------------------------------------------------------
Client connecting to 192.168.1.129, TCP port 5001
TCP window size: 96.2 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.12 port 56022 connected with 192.168.1.129 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  9.88 MBytes  82.8 Mbits/sec
[  3]  1.0- 2.0 sec  9.62 MBytes  80.7 Mbits/sec
[  3]  2.0- 3.0 sec  9.75 MBytes  81.8 Mbits/sec
[  3]  3.0- 4.0 sec  9.62 MBytes  80.7 Mbits/sec
[  3]  4.0- 5.0 sec  9.62 MBytes  80.7 Mbits/sec
[  3]  5.0- 6.0 sec  9.62 MBytes  80.7 Mbits/sec
[  3]  6.0- 7.0 sec  9.50 MBytes  79.7 Mbits/sec
[  3]  7.0- 8.0 sec  9.75 MBytes  81.8 Mbits/sec
[  3]  8.0- 9.0 sec  9.62 MBytes  80.7 Mbits/sec
[  3]  9.0-10.0 sec  9.50 MBytes  79.7 Mbits/sec
[  3]  0.0-10.0 sec  96.5 MBytes  80.9 Mbits/sec

GOOD

# iperf -t 10 -i 1 -c 192.168.1.129
------------------------------------------------------------
Client connecting to 192.168.1.129, TCP port 5001
TCP window size: 96.2 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.12 port 54898 connected with 192.168.1.129 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  11.2 MBytes  94.4 Mbits/sec
[  3]  1.0- 2.0 sec  11.0 MBytes  92.3 Mbits/sec
[  3]  2.0- 3.0 sec  10.8 MBytes  90.2 Mbits/sec
[  3]  3.0- 4.0 sec  11.0 MBytes  92.3 Mbits/sec
[  3]  4.0- 5.0 sec  10.9 MBytes  91.2 Mbits/sec
[  3]  5.0- 6.0 sec  10.9 MBytes  91.2 Mbits/sec
[  3]  6.0- 7.0 sec  10.8 MBytes  90.2 Mbits/sec
[  3]  7.0- 8.0 sec  10.9 MBytes  91.2 Mbits/sec
[  3]  8.0- 9.0 sec  10.9 MBytes  91.2 Mbits/sec
[  3]  9.0-10.0 sec  10.9 MBytes  91.2 Mbits/sec
[  3]  0.0-10.0 sec   109 MBytes  91.4 Mbits/sec

We were able to bisect this down to this commit:

first bad commit: [65466904b015f6eeb9225b51aeb29b01a1d4b59c] tcp: adjust
TSO packet sizes based on min_rtt

Disabling this new setting via:

echo 0 > /proc/sys/net/ipv4/tcp_tso_rtt_log

confirm that this was the cause of the performance regression.

Is it expected that the new default setting has such a performance impact?

More information of the platform ...

# ethtool -k eth0
Features for eth0:
rx-checksumming: on
tx-checksumming: on
     tx-checksum-ipv4: on
     tx-checksum-ip-generic: off [fixed]
     tx-checksum-ipv6: on
     tx-checksum-fcoe-crc: off [fixed]
     tx-checksum-sctp: off [fixed]
scatter-gather: on
     tx-scatter-gather: on
     tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
     tx-tcp-segmentation: on
     tx-tcp-ecn-segmentation: off [fixed]
     tx-tcp-mangleid-segmentation: off
     tx-tcp6-segmentation: off [fixed]
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: off [fixed]
ntuple-filters: off [fixed]
receive-hashing: off [fixed]
highdma: off [fixed]
rx-vlan-filter: off [fixed]
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: off [fixed]
tx-gre-csum-segmentation: off [fixed]
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: off [fixed]
tx-udp_tnl-csum-segmentation: off [fixed]
tx-gso-partial: off [fixed]
tx-tunnel-remcsum-segmentation: off [fixed]
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: off [fixed]
tx-gso-list: off [fixed]
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: off [fixed]
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: off [fixed]
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]
rx-gro-list: off
macsec-hw-offload: off [fixed]
rx-udp-gro-forwarding: off
hsr-tag-ins-offload: off [fixed]
hsr-tag-rm-offload: off [fixed]
hsr-fwd-offload: off [fixed]
hsr-dup-offload: off [fixed]

[1] -
https://elixir.bootlin.com/linux/latest/source/arch/arm/boot/dts/nxp/imx/imx6ull-tarragon-master.dts

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iperf performance regression since Linux 5.18
  2023-10-09 18:58 iperf performance regression since Linux 5.18 Stefan Wahren
@ 2023-10-09 19:10 ` Eric Dumazet
  2023-10-09 19:19   ` Neal Cardwell
  2023-10-11 12:58   ` Stefan Wahren
  0 siblings, 2 replies; 20+ messages in thread
From: Eric Dumazet @ 2023-10-09 19:10 UTC (permalink / raw)
  To: Stefan Wahren
  Cc: Jakub Kicinski, Neal Cardwell, Fabio Estevam, linux-imx,
	Stefan Wahren, Michael Heimpold, netdev

On Mon, Oct 9, 2023 at 8:58 PM Stefan Wahren <wahrenst@gmx.net> wrote:
>
> Hi,
> we recently switched on our ARM NXP i.MX6ULL based embedded device
> (Tarragon Master [1]) from an older kernel version to Linux 6.1. After
> that we noticed a measurable performance regression on the Ethernet
> interface (driver: fec, 100 Mbit link) while running iperf client on the
> device:
>
> BAD
>
> # iperf -t 10 -i 1 -c 192.168.1.129
> ------------------------------------------------------------
> Client connecting to 192.168.1.129, TCP port 5001
> TCP window size: 96.2 KByte (default)
> ------------------------------------------------------------
> [  3] local 192.168.1.12 port 56022 connected with 192.168.1.129 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0- 1.0 sec  9.88 MBytes  82.8 Mbits/sec
> [  3]  1.0- 2.0 sec  9.62 MBytes  80.7 Mbits/sec
> [  3]  2.0- 3.0 sec  9.75 MBytes  81.8 Mbits/sec
> [  3]  3.0- 4.0 sec  9.62 MBytes  80.7 Mbits/sec
> [  3]  4.0- 5.0 sec  9.62 MBytes  80.7 Mbits/sec
> [  3]  5.0- 6.0 sec  9.62 MBytes  80.7 Mbits/sec
> [  3]  6.0- 7.0 sec  9.50 MBytes  79.7 Mbits/sec
> [  3]  7.0- 8.0 sec  9.75 MBytes  81.8 Mbits/sec
> [  3]  8.0- 9.0 sec  9.62 MBytes  80.7 Mbits/sec
> [  3]  9.0-10.0 sec  9.50 MBytes  79.7 Mbits/sec
> [  3]  0.0-10.0 sec  96.5 MBytes  80.9 Mbits/sec
>
> GOOD
>
> # iperf -t 10 -i 1 -c 192.168.1.129
> ------------------------------------------------------------
> Client connecting to 192.168.1.129, TCP port 5001
> TCP window size: 96.2 KByte (default)
> ------------------------------------------------------------
> [  3] local 192.168.1.12 port 54898 connected with 192.168.1.129 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0- 1.0 sec  11.2 MBytes  94.4 Mbits/sec
> [  3]  1.0- 2.0 sec  11.0 MBytes  92.3 Mbits/sec
> [  3]  2.0- 3.0 sec  10.8 MBytes  90.2 Mbits/sec
> [  3]  3.0- 4.0 sec  11.0 MBytes  92.3 Mbits/sec
> [  3]  4.0- 5.0 sec  10.9 MBytes  91.2 Mbits/sec
> [  3]  5.0- 6.0 sec  10.9 MBytes  91.2 Mbits/sec
> [  3]  6.0- 7.0 sec  10.8 MBytes  90.2 Mbits/sec
> [  3]  7.0- 8.0 sec  10.9 MBytes  91.2 Mbits/sec
> [  3]  8.0- 9.0 sec  10.9 MBytes  91.2 Mbits/sec
> [  3]  9.0-10.0 sec  10.9 MBytes  91.2 Mbits/sec
> [  3]  0.0-10.0 sec   109 MBytes  91.4 Mbits/sec
>
> We were able to bisect this down to this commit:
>
> first bad commit: [65466904b015f6eeb9225b51aeb29b01a1d4b59c] tcp: adjust
> TSO packet sizes based on min_rtt
>
> Disabling this new setting via:
>
> echo 0 > /proc/sys/net/ipv4/tcp_tso_rtt_log
>
> confirm that this was the cause of the performance regression.
>
> Is it expected that the new default setting has such a performance impact?

Thanks for the report

Normally no. I guess you need to give us more details.

qdisc in use, MTU in use, congestion control in use, "ss -temoi dst
192.168.1.129 " output from sender side while the flow is running.

Note that reaching line rate on a TCP flow is always tricky,
regardless of what 'line rate' is.

I suspect an issue on the receiving side with larger GRO packets perhaps ?

You could try to limit GRO or TSO packet sizes to determine if this is
a driver issue.

(ip link set dev ethX gro_max_size XXXXX  gso_max_size YYYYY)


>
> More information of the platform ...
>
> # ethtool -k eth0
> Features for eth0:
> rx-checksumming: on
> tx-checksumming: on
>      tx-checksum-ipv4: on
>      tx-checksum-ip-generic: off [fixed]
>      tx-checksum-ipv6: on
>      tx-checksum-fcoe-crc: off [fixed]
>      tx-checksum-sctp: off [fixed]
> scatter-gather: on
>      tx-scatter-gather: on
>      tx-scatter-gather-fraglist: off [fixed]
> tcp-segmentation-offload: on
>      tx-tcp-segmentation: on
>      tx-tcp-ecn-segmentation: off [fixed]
>      tx-tcp-mangleid-segmentation: off
>      tx-tcp6-segmentation: off [fixed]
> generic-segmentation-offload: on
> generic-receive-offload: on
> large-receive-offload: off [fixed]
> rx-vlan-offload: on
> tx-vlan-offload: off [fixed]
> ntuple-filters: off [fixed]
> receive-hashing: off [fixed]
> highdma: off [fixed]
> rx-vlan-filter: off [fixed]
> vlan-challenged: off [fixed]
> tx-lockless: off [fixed]
> netns-local: off [fixed]
> tx-gso-robust: off [fixed]
> tx-fcoe-segmentation: off [fixed]
> tx-gre-segmentation: off [fixed]
> tx-gre-csum-segmentation: off [fixed]
> tx-ipxip4-segmentation: off [fixed]
> tx-ipxip6-segmentation: off [fixed]
> tx-udp_tnl-segmentation: off [fixed]
> tx-udp_tnl-csum-segmentation: off [fixed]
> tx-gso-partial: off [fixed]
> tx-tunnel-remcsum-segmentation: off [fixed]
> tx-sctp-segmentation: off [fixed]
> tx-esp-segmentation: off [fixed]
> tx-udp-segmentation: off [fixed]
> tx-gso-list: off [fixed]
> fcoe-mtu: off [fixed]
> tx-nocache-copy: off
> loopback: off [fixed]
> rx-fcs: off [fixed]
> rx-all: off [fixed]
> tx-vlan-stag-hw-insert: off [fixed]
> rx-vlan-stag-hw-parse: off [fixed]
> rx-vlan-stag-filter: off [fixed]
> l2-fwd-offload: off [fixed]
> hw-tc-offload: off [fixed]
> esp-hw-offload: off [fixed]
> esp-tx-csum-hw-offload: off [fixed]
> rx-udp_tunnel-port-offload: off [fixed]
> tls-hw-tx-offload: off [fixed]
> tls-hw-rx-offload: off [fixed]
> rx-gro-hw: off [fixed]
> tls-hw-record: off [fixed]
> rx-gro-list: off
> macsec-hw-offload: off [fixed]
> rx-udp-gro-forwarding: off
> hsr-tag-ins-offload: off [fixed]
> hsr-tag-rm-offload: off [fixed]
> hsr-fwd-offload: off [fixed]
> hsr-dup-offload: off [fixed]
>
> [1] -
> https://elixir.bootlin.com/linux/latest/source/arch/arm/boot/dts/nxp/imx/imx6ull-tarragon-master.dts

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iperf performance regression since Linux 5.18
  2023-10-09 19:10 ` Eric Dumazet
@ 2023-10-09 19:19   ` Neal Cardwell
  2023-10-13 13:37     ` Stefan Wahren
  2023-10-11 12:58   ` Stefan Wahren
  1 sibling, 1 reply; 20+ messages in thread
From: Neal Cardwell @ 2023-10-09 19:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stefan Wahren, Jakub Kicinski, Fabio Estevam, linux-imx,
	Stefan Wahren, Michael Heimpold, netdev

On Mon, Oct 9, 2023 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Mon, Oct 9, 2023 at 8:58 PM Stefan Wahren <wahrenst@gmx.net> wrote:
> >
> > Hi,
> > we recently switched on our ARM NXP i.MX6ULL based embedded device
> > (Tarragon Master [1]) from an older kernel version to Linux 6.1. After
> > that we noticed a measurable performance regression on the Ethernet
> > interface (driver: fec, 100 Mbit link) while running iperf client on the
> > device:
> >
> > BAD
> >
> > # iperf -t 10 -i 1 -c 192.168.1.129
> > ------------------------------------------------------------
> > Client connecting to 192.168.1.129, TCP port 5001
> > TCP window size: 96.2 KByte (default)
> > ------------------------------------------------------------
> > [  3] local 192.168.1.12 port 56022 connected with 192.168.1.129 port 5001
> > [ ID] Interval       Transfer     Bandwidth
> > [  3]  0.0- 1.0 sec  9.88 MBytes  82.8 Mbits/sec
> > [  3]  1.0- 2.0 sec  9.62 MBytes  80.7 Mbits/sec
> > [  3]  2.0- 3.0 sec  9.75 MBytes  81.8 Mbits/sec
> > [  3]  3.0- 4.0 sec  9.62 MBytes  80.7 Mbits/sec
> > [  3]  4.0- 5.0 sec  9.62 MBytes  80.7 Mbits/sec
> > [  3]  5.0- 6.0 sec  9.62 MBytes  80.7 Mbits/sec
> > [  3]  6.0- 7.0 sec  9.50 MBytes  79.7 Mbits/sec
> > [  3]  7.0- 8.0 sec  9.75 MBytes  81.8 Mbits/sec
> > [  3]  8.0- 9.0 sec  9.62 MBytes  80.7 Mbits/sec
> > [  3]  9.0-10.0 sec  9.50 MBytes  79.7 Mbits/sec
> > [  3]  0.0-10.0 sec  96.5 MBytes  80.9 Mbits/sec
> >
> > GOOD
> >
> > # iperf -t 10 -i 1 -c 192.168.1.129
> > ------------------------------------------------------------
> > Client connecting to 192.168.1.129, TCP port 5001
> > TCP window size: 96.2 KByte (default)
> > ------------------------------------------------------------
> > [  3] local 192.168.1.12 port 54898 connected with 192.168.1.129 port 5001
> > [ ID] Interval       Transfer     Bandwidth
> > [  3]  0.0- 1.0 sec  11.2 MBytes  94.4 Mbits/sec
> > [  3]  1.0- 2.0 sec  11.0 MBytes  92.3 Mbits/sec
> > [  3]  2.0- 3.0 sec  10.8 MBytes  90.2 Mbits/sec
> > [  3]  3.0- 4.0 sec  11.0 MBytes  92.3 Mbits/sec
> > [  3]  4.0- 5.0 sec  10.9 MBytes  91.2 Mbits/sec
> > [  3]  5.0- 6.0 sec  10.9 MBytes  91.2 Mbits/sec
> > [  3]  6.0- 7.0 sec  10.8 MBytes  90.2 Mbits/sec
> > [  3]  7.0- 8.0 sec  10.9 MBytes  91.2 Mbits/sec
> > [  3]  8.0- 9.0 sec  10.9 MBytes  91.2 Mbits/sec
> > [  3]  9.0-10.0 sec  10.9 MBytes  91.2 Mbits/sec
> > [  3]  0.0-10.0 sec   109 MBytes  91.4 Mbits/sec
> >
> > We were able to bisect this down to this commit:
> >
> > first bad commit: [65466904b015f6eeb9225b51aeb29b01a1d4b59c] tcp: adjust
> > TSO packet sizes based on min_rtt
> >
> > Disabling this new setting via:
> >
> > echo 0 > /proc/sys/net/ipv4/tcp_tso_rtt_log
> >
> > confirm that this was the cause of the performance regression.
> >
> > Is it expected that the new default setting has such a performance impact?

Indeed, thanks for the report.

In addition to the "ss" output Eric mentioned, could you please grab
"nstat" output, which should allow us to calculate the average TSO/GSO
and LRO/GRO burst sizes, which is the key thing tuned with the
tcp_tso_rtt_log knob.

So it would be great to have the following from both data sender and
data receiver, for both the good case and bad case, if you could start
these before your test and kill them after the test stops:

(while true; do date; ss -tenmoi; sleep 1; done) > /root/ss.txt &
nstat -n; (while true; do date; nstat; sleep 1; done)  > /root/nstat.txt

Thanks!
neal

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iperf performance regression since Linux 5.18
  2023-10-09 19:10 ` Eric Dumazet
  2023-10-09 19:19   ` Neal Cardwell
@ 2023-10-11 12:58   ` Stefan Wahren
  1 sibling, 0 replies; 20+ messages in thread
From: Stefan Wahren @ 2023-10-11 12:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jakub Kicinski, Neal Cardwell, Fabio Estevam, linux-imx,
	Stefan Wahren, Michael Heimpold, netdev

Hi Eric,

Am 09.10.23 um 21:10 schrieb Eric Dumazet:
> On Mon, Oct 9, 2023 at 8:58 PM Stefan Wahren <wahrenst@gmx.net> wrote:
>> Hi,
>> we recently switched on our ARM NXP i.MX6ULL based embedded device
>> (Tarragon Master [1]) from an older kernel version to Linux 6.1. After
>> that we noticed a measurable performance regression on the Ethernet
>> interface (driver: fec, 100 Mbit link) while running iperf client on the
>> device:
>>
>> BAD
>>
>> # iperf -t 10 -i 1 -c 192.168.1.129
>> ------------------------------------------------------------
>> Client connecting to 192.168.1.129, TCP port 5001
>> TCP window size: 96.2 KByte (default)
>> ------------------------------------------------------------
>> [  3] local 192.168.1.12 port 56022 connected with 192.168.1.129 port 5001
>> [ ID] Interval       Transfer     Bandwidth
>> [  3]  0.0- 1.0 sec  9.88 MBytes  82.8 Mbits/sec
>> [  3]  1.0- 2.0 sec  9.62 MBytes  80.7 Mbits/sec
>> [  3]  2.0- 3.0 sec  9.75 MBytes  81.8 Mbits/sec
>> [  3]  3.0- 4.0 sec  9.62 MBytes  80.7 Mbits/sec
>> [  3]  4.0- 5.0 sec  9.62 MBytes  80.7 Mbits/sec
>> [  3]  5.0- 6.0 sec  9.62 MBytes  80.7 Mbits/sec
>> [  3]  6.0- 7.0 sec  9.50 MBytes  79.7 Mbits/sec
>> [  3]  7.0- 8.0 sec  9.75 MBytes  81.8 Mbits/sec
>> [  3]  8.0- 9.0 sec  9.62 MBytes  80.7 Mbits/sec
>> [  3]  9.0-10.0 sec  9.50 MBytes  79.7 Mbits/sec
>> [  3]  0.0-10.0 sec  96.5 MBytes  80.9 Mbits/sec
>>
>> GOOD
>>
>> # iperf -t 10 -i 1 -c 192.168.1.129
>> ------------------------------------------------------------
>> Client connecting to 192.168.1.129, TCP port 5001
>> TCP window size: 96.2 KByte (default)
>> ------------------------------------------------------------
>> [  3] local 192.168.1.12 port 54898 connected with 192.168.1.129 port 5001
>> [ ID] Interval       Transfer     Bandwidth
>> [  3]  0.0- 1.0 sec  11.2 MBytes  94.4 Mbits/sec
>> [  3]  1.0- 2.0 sec  11.0 MBytes  92.3 Mbits/sec
>> [  3]  2.0- 3.0 sec  10.8 MBytes  90.2 Mbits/sec
>> [  3]  3.0- 4.0 sec  11.0 MBytes  92.3 Mbits/sec
>> [  3]  4.0- 5.0 sec  10.9 MBytes  91.2 Mbits/sec
>> [  3]  5.0- 6.0 sec  10.9 MBytes  91.2 Mbits/sec
>> [  3]  6.0- 7.0 sec  10.8 MBytes  90.2 Mbits/sec
>> [  3]  7.0- 8.0 sec  10.9 MBytes  91.2 Mbits/sec
>> [  3]  8.0- 9.0 sec  10.9 MBytes  91.2 Mbits/sec
>> [  3]  9.0-10.0 sec  10.9 MBytes  91.2 Mbits/sec
>> [  3]  0.0-10.0 sec   109 MBytes  91.4 Mbits/sec
>>
>> We were able to bisect this down to this commit:
>>
>> first bad commit: [65466904b015f6eeb9225b51aeb29b01a1d4b59c] tcp: adjust
>> TSO packet sizes based on min_rtt
>>
>> Disabling this new setting via:
>>
>> echo 0 > /proc/sys/net/ipv4/tcp_tso_rtt_log
>>
>> confirm that this was the cause of the performance regression.
>>
>> Is it expected that the new default setting has such a performance impact?
> Thanks for the report
>
> Normally no. I guess you need to give us more details.
>
> qdisc in use, MTU in use, congestion control in use, "ss -temoi dst
> 192.168.1.129 " output from sender side while the flow is running.
since ss and nstat are not yet available on my Embedded device this will
take some time.

But at least the available information:

qdisc: pfifo_fast
MTU: 1500
congestion control: cubic

Best regards
>
> Note that reaching line rate on a TCP flow is always tricky,
> regardless of what 'line rate' is.
>
> I suspect an issue on the receiving side with larger GRO packets perhaps ?
>
> You could try to limit GRO or TSO packet sizes to determine if this is
> a driver issue.
>
> (ip link set dev ethX gro_max_size XXXXX  gso_max_size YYYYY)
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iperf performance regression since Linux 5.18
  2023-10-09 19:19   ` Neal Cardwell
@ 2023-10-13 13:37     ` Stefan Wahren
  2023-10-14 19:40       ` Neal Cardwell
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Wahren @ 2023-10-13 13:37 UTC (permalink / raw)
  To: Neal Cardwell, Eric Dumazet
  Cc: Jakub Kicinski, Fabio Estevam, linux-imx, Stefan Wahren,
	Michael Heimpold, netdev

Hi,

Am 09.10.23 um 21:19 schrieb Neal Cardwell:
> On Mon, Oct 9, 2023 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
>> On Mon, Oct 9, 2023 at 8:58 PM Stefan Wahren <wahrenst@gmx.net> wrote:
>>> Hi,
>>> we recently switched on our ARM NXP i.MX6ULL based embedded device
>>> (Tarragon Master [1]) from an older kernel version to Linux 6.1. After
>>> that we noticed a measurable performance regression on the Ethernet
>>> interface (driver: fec, 100 Mbit link) while running iperf client on the
>>> device:
>>>
>>> BAD
>>>
>>> # iperf -t 10 -i 1 -c 192.168.1.129
>>> ------------------------------------------------------------
>>> Client connecting to 192.168.1.129, TCP port 5001
>>> TCP window size: 96.2 KByte (default)
>>> ------------------------------------------------------------
>>> [  3] local 192.168.1.12 port 56022 connected with 192.168.1.129 port 5001
>>> [ ID] Interval       Transfer     Bandwidth
>>> [  3]  0.0- 1.0 sec  9.88 MBytes  82.8 Mbits/sec
>>> [  3]  1.0- 2.0 sec  9.62 MBytes  80.7 Mbits/sec
>>> [  3]  2.0- 3.0 sec  9.75 MBytes  81.8 Mbits/sec
>>> [  3]  3.0- 4.0 sec  9.62 MBytes  80.7 Mbits/sec
>>> [  3]  4.0- 5.0 sec  9.62 MBytes  80.7 Mbits/sec
>>> [  3]  5.0- 6.0 sec  9.62 MBytes  80.7 Mbits/sec
>>> [  3]  6.0- 7.0 sec  9.50 MBytes  79.7 Mbits/sec
>>> [  3]  7.0- 8.0 sec  9.75 MBytes  81.8 Mbits/sec
>>> [  3]  8.0- 9.0 sec  9.62 MBytes  80.7 Mbits/sec
>>> [  3]  9.0-10.0 sec  9.50 MBytes  79.7 Mbits/sec
>>> [  3]  0.0-10.0 sec  96.5 MBytes  80.9 Mbits/sec
>>>
>>> GOOD
>>>
>>> # iperf -t 10 -i 1 -c 192.168.1.129
>>> ------------------------------------------------------------
>>> Client connecting to 192.168.1.129, TCP port 5001
>>> TCP window size: 96.2 KByte (default)
>>> ------------------------------------------------------------
>>> [  3] local 192.168.1.12 port 54898 connected with 192.168.1.129 port 5001
>>> [ ID] Interval       Transfer     Bandwidth
>>> [  3]  0.0- 1.0 sec  11.2 MBytes  94.4 Mbits/sec
>>> [  3]  1.0- 2.0 sec  11.0 MBytes  92.3 Mbits/sec
>>> [  3]  2.0- 3.0 sec  10.8 MBytes  90.2 Mbits/sec
>>> [  3]  3.0- 4.0 sec  11.0 MBytes  92.3 Mbits/sec
>>> [  3]  4.0- 5.0 sec  10.9 MBytes  91.2 Mbits/sec
>>> [  3]  5.0- 6.0 sec  10.9 MBytes  91.2 Mbits/sec
>>> [  3]  6.0- 7.0 sec  10.8 MBytes  90.2 Mbits/sec
>>> [  3]  7.0- 8.0 sec  10.9 MBytes  91.2 Mbits/sec
>>> [  3]  8.0- 9.0 sec  10.9 MBytes  91.2 Mbits/sec
>>> [  3]  9.0-10.0 sec  10.9 MBytes  91.2 Mbits/sec
>>> [  3]  0.0-10.0 sec   109 MBytes  91.4 Mbits/sec
>>>
>>> We were able to bisect this down to this commit:
>>>
>>> first bad commit: [65466904b015f6eeb9225b51aeb29b01a1d4b59c] tcp: adjust
>>> TSO packet sizes based on min_rtt
>>>
>>> Disabling this new setting via:
>>>
>>> echo 0 > /proc/sys/net/ipv4/tcp_tso_rtt_log
>>>
>>> confirm that this was the cause of the performance regression.
>>>
>>> Is it expected that the new default setting has such a performance impact?
> Indeed, thanks for the report.
>
> In addition to the "ss" output Eric mentioned, could you please grab
> "nstat" output, which should allow us to calculate the average TSO/GSO
> and LRO/GRO burst sizes, which is the key thing tuned with the
> tcp_tso_rtt_log knob.
>
> So it would be great to have the following from both data sender and
> data receiver, for both the good case and bad case, if you could start
> these before your test and kill them after the test stops:
>
> (while true; do date; ss -tenmoi; sleep 1; done) > /root/ss.txt &
> nstat -n; (while true; do date; nstat; sleep 1; done)  > /root/nstat.txt
i upload everything here:
https://github.com/lategoodbye/tcp_tso_rtt_log_regress

The server part is a Ubuntu installation connected to the internet. At
first i logged the good case, then i continued with the bad case.
Accidentally i delete a log file of bad case, so i repeated the whole
bad case again. So the uploaded bad case files are from the third run.

>
> Thanks!
> neal


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iperf performance regression since Linux 5.18
  2023-10-13 13:37     ` Stefan Wahren
@ 2023-10-14 19:40       ` Neal Cardwell
  2023-10-14 21:16         ` Dave Taht
                           ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Neal Cardwell @ 2023-10-14 19:40 UTC (permalink / raw)
  To: Stefan Wahren
  Cc: Eric Dumazet, Jakub Kicinski, Fabio Estevam, linux-imx,
	Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng

On Fri, Oct 13, 2023 at 9:37 AM Stefan Wahren <wahrenst@gmx.net> wrote:
>
> Hi,
>
> Am 09.10.23 um 21:19 schrieb Neal Cardwell:
> > On Mon, Oct 9, 2023 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
> >> On Mon, Oct 9, 2023 at 8:58 PM Stefan Wahren <wahrenst@gmx.net> wrote:
> >>> Hi,
> >>> we recently switched on our ARM NXP i.MX6ULL based embedded device
> >>> (Tarragon Master [1]) from an older kernel version to Linux 6.1. After
> >>> that we noticed a measurable performance regression on the Ethernet
> >>> interface (driver: fec, 100 Mbit link) while running iperf client on the
> >>> device:
> >>>
> >>> BAD
> >>>
> >>> # iperf -t 10 -i 1 -c 192.168.1.129
> >>> ------------------------------------------------------------
> >>> Client connecting to 192.168.1.129, TCP port 5001
> >>> TCP window size: 96.2 KByte (default)
> >>> ------------------------------------------------------------
> >>> [  3] local 192.168.1.12 port 56022 connected with 192.168.1.129 port 5001
> >>> [ ID] Interval       Transfer     Bandwidth
> >>> [  3]  0.0- 1.0 sec  9.88 MBytes  82.8 Mbits/sec
> >>> [  3]  1.0- 2.0 sec  9.62 MBytes  80.7 Mbits/sec
> >>> [  3]  2.0- 3.0 sec  9.75 MBytes  81.8 Mbits/sec
> >>> [  3]  3.0- 4.0 sec  9.62 MBytes  80.7 Mbits/sec
> >>> [  3]  4.0- 5.0 sec  9.62 MBytes  80.7 Mbits/sec
> >>> [  3]  5.0- 6.0 sec  9.62 MBytes  80.7 Mbits/sec
> >>> [  3]  6.0- 7.0 sec  9.50 MBytes  79.7 Mbits/sec
> >>> [  3]  7.0- 8.0 sec  9.75 MBytes  81.8 Mbits/sec
> >>> [  3]  8.0- 9.0 sec  9.62 MBytes  80.7 Mbits/sec
> >>> [  3]  9.0-10.0 sec  9.50 MBytes  79.7 Mbits/sec
> >>> [  3]  0.0-10.0 sec  96.5 MBytes  80.9 Mbits/sec
> >>>
> >>> GOOD
> >>>
> >>> # iperf -t 10 -i 1 -c 192.168.1.129
> >>> ------------------------------------------------------------
> >>> Client connecting to 192.168.1.129, TCP port 5001
> >>> TCP window size: 96.2 KByte (default)
> >>> ------------------------------------------------------------
> >>> [  3] local 192.168.1.12 port 54898 connected with 192.168.1.129 port 5001
> >>> [ ID] Interval       Transfer     Bandwidth
> >>> [  3]  0.0- 1.0 sec  11.2 MBytes  94.4 Mbits/sec
> >>> [  3]  1.0- 2.0 sec  11.0 MBytes  92.3 Mbits/sec
> >>> [  3]  2.0- 3.0 sec  10.8 MBytes  90.2 Mbits/sec
> >>> [  3]  3.0- 4.0 sec  11.0 MBytes  92.3 Mbits/sec
> >>> [  3]  4.0- 5.0 sec  10.9 MBytes  91.2 Mbits/sec
> >>> [  3]  5.0- 6.0 sec  10.9 MBytes  91.2 Mbits/sec
> >>> [  3]  6.0- 7.0 sec  10.8 MBytes  90.2 Mbits/sec
> >>> [  3]  7.0- 8.0 sec  10.9 MBytes  91.2 Mbits/sec
> >>> [  3]  8.0- 9.0 sec  10.9 MBytes  91.2 Mbits/sec
> >>> [  3]  9.0-10.0 sec  10.9 MBytes  91.2 Mbits/sec
> >>> [  3]  0.0-10.0 sec   109 MBytes  91.4 Mbits/sec
> >>>
> >>> We were able to bisect this down to this commit:
> >>>
> >>> first bad commit: [65466904b015f6eeb9225b51aeb29b01a1d4b59c] tcp: adjust
> >>> TSO packet sizes based on min_rtt
> >>>
> >>> Disabling this new setting via:
> >>>
> >>> echo 0 > /proc/sys/net/ipv4/tcp_tso_rtt_log
> >>>
> >>> confirm that this was the cause of the performance regression.
> >>>
> >>> Is it expected that the new default setting has such a performance impact?
> > Indeed, thanks for the report.
> >
> > In addition to the "ss" output Eric mentioned, could you please grab
> > "nstat" output, which should allow us to calculate the average TSO/GSO
> > and LRO/GRO burst sizes, which is the key thing tuned with the
> > tcp_tso_rtt_log knob.
> >
> > So it would be great to have the following from both data sender and
> > data receiver, for both the good case and bad case, if you could start
> > these before your test and kill them after the test stops:
> >
> > (while true; do date; ss -tenmoi; sleep 1; done) > /root/ss.txt &
> > nstat -n; (while true; do date; nstat; sleep 1; done)  > /root/nstat.txt
> i upload everything here:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress
>
> The server part is a Ubuntu installation connected to the internet. At
> first i logged the good case, then i continued with the bad case.
> Accidentally i delete a log file of bad case, so i repeated the whole
> bad case again. So the uploaded bad case files are from the third run.

Thanks for the detailed data!

Here are some notes from looking at this data:

+ bad client: avg TSO burst size is roughly:
https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_bad.log
IpOutRequests                   308               44.7
IpExtOutOctets                  10050656        1403181.0
est bytes   per TSO burst: 10050656 / 308 = 32632
est packets per TSO burst: 32632 / 1448 ~= 22.5

+ good client: avg TSO burst size is roughly:
https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_good.log
IpOutRequests                   529               62.0
IpExtOutOctets                  11502992        1288711.5
est bytes   per TSO burst: 11502992 / 529 ~= 21745
est packets per TSO burst: 21745 / 1448 ~= 15.0

+ bad client ss data:
https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_bad.log
State Recv-Q Send-Q Local Address:Port   Peer Address:PortProcess
ESTAB 0      236024  192.168.1.12:39228 192.168.1.129:5001
timer:(on,030ms,0) ino:25876 sk:414f52af rto:0.21 cwnd:68 ssthresh:20
reordering:0
Mbits/sec allowed by cwnd: 68 * 1448 * 8 / .0018 / 1000000.0 ~= 437.6

+ good client ss data:
https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_good.log
Fri Oct 13 15:04:36 CEST 2023
State Recv-Q Send-Q Local Address:Port   Peer Address:PortProcess
ESTAB 0      425712  192.168.1.12:33284 192.168.1.129:5001
timer:(on,020ms,0) ino:20654 sk:414f52af rto:0.21 cwnd:106 ssthresh:20
reordering:0
Mbits/sec allowed by cwnd: 106 * 1448 * 8 / .0028 / 1000000.0 = 438.5

So it seems indeed like cwnd is not the limiting factor, and instead
there is something about the larger TSO/GSO bursts (roughly 22.5
packets per burst on average) in the "bad" case that is causing
problems, and preventing the sender from keeping the pipe fully
utilized.

So perhaps the details of the tcp_tso_should_defer() logic are hurting
performance?

The default value of tcp_tso_win_divisor is 3, and in the bad case the
cwnd / tcp_tso_win_divisor = 68 / 3 = 22.7 packets, which is
suspiciously close to the average TSO burst size of 22.5. So my guess
is that the tcp_tso_win_divisor of 3 is the dominant factor here, and
perhaps if we raise it to 5, then 68/5 ~= 13.60 will approximate the
TSO burst size in the "good" case, and fully utilize the pipe. So it
seems worth an experiment, to see what we can learn.

To test that theory, could you please try running the following as
root on the data sender machine, and then re-running the "bad" test
with tcp_tso_rtt_log at the default value of 9?

   sysctl net.ipv4.tcp_tso_win_divisor=5

Thanks!
neal

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iperf performance regression since Linux 5.18
  2023-10-14 19:40       ` Neal Cardwell
@ 2023-10-14 21:16         ` Dave Taht
  2023-10-14 22:51         ` Eric Dumazet
  2023-10-15  0:06         ` Stefan Wahren
  2 siblings, 0 replies; 20+ messages in thread
From: Dave Taht @ 2023-10-14 21:16 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Stefan Wahren, Eric Dumazet, Jakub Kicinski, Fabio Estevam,
	linux-imx, Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng

I am also kind of a huge believer in packet captures...

On Sat, Oct 14, 2023 at 12:41 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Fri, Oct 13, 2023 at 9:37 AM Stefan Wahren <wahrenst@gmx.net> wrote:
> >
> > Hi,
> >
> > Am 09.10.23 um 21:19 schrieb Neal Cardwell:
> > > On Mon, Oct 9, 2023 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
> > >> On Mon, Oct 9, 2023 at 8:58 PM Stefan Wahren <wahrenst@gmx.net> wrote:
> > >>> Hi,
> > >>> we recently switched on our ARM NXP i.MX6ULL based embedded device
> > >>> (Tarragon Master [1]) from an older kernel version to Linux 6.1. After
> > >>> that we noticed a measurable performance regression on the Ethernet
> > >>> interface (driver: fec, 100 Mbit link) while running iperf client on the
> > >>> device:
> > >>>
> > >>> BAD
> > >>>
> > >>> # iperf -t 10 -i 1 -c 192.168.1.129
> > >>> ------------------------------------------------------------
> > >>> Client connecting to 192.168.1.129, TCP port 5001
> > >>> TCP window size: 96.2 KByte (default)
> > >>> ------------------------------------------------------------
> > >>> [  3] local 192.168.1.12 port 56022 connected with 192.168.1.129 port 5001
> > >>> [ ID] Interval       Transfer     Bandwidth
> > >>> [  3]  0.0- 1.0 sec  9.88 MBytes  82.8 Mbits/sec
> > >>> [  3]  1.0- 2.0 sec  9.62 MBytes  80.7 Mbits/sec
> > >>> [  3]  2.0- 3.0 sec  9.75 MBytes  81.8 Mbits/sec
> > >>> [  3]  3.0- 4.0 sec  9.62 MBytes  80.7 Mbits/sec
> > >>> [  3]  4.0- 5.0 sec  9.62 MBytes  80.7 Mbits/sec
> > >>> [  3]  5.0- 6.0 sec  9.62 MBytes  80.7 Mbits/sec
> > >>> [  3]  6.0- 7.0 sec  9.50 MBytes  79.7 Mbits/sec
> > >>> [  3]  7.0- 8.0 sec  9.75 MBytes  81.8 Mbits/sec
> > >>> [  3]  8.0- 9.0 sec  9.62 MBytes  80.7 Mbits/sec
> > >>> [  3]  9.0-10.0 sec  9.50 MBytes  79.7 Mbits/sec
> > >>> [  3]  0.0-10.0 sec  96.5 MBytes  80.9 Mbits/sec
> > >>>
> > >>> GOOD
> > >>>
> > >>> # iperf -t 10 -i 1 -c 192.168.1.129
> > >>> ------------------------------------------------------------
> > >>> Client connecting to 192.168.1.129, TCP port 5001
> > >>> TCP window size: 96.2 KByte (default)
> > >>> ------------------------------------------------------------
> > >>> [  3] local 192.168.1.12 port 54898 connected with 192.168.1.129 port 5001
> > >>> [ ID] Interval       Transfer     Bandwidth
> > >>> [  3]  0.0- 1.0 sec  11.2 MBytes  94.4 Mbits/sec
> > >>> [  3]  1.0- 2.0 sec  11.0 MBytes  92.3 Mbits/sec
> > >>> [  3]  2.0- 3.0 sec  10.8 MBytes  90.2 Mbits/sec
> > >>> [  3]  3.0- 4.0 sec  11.0 MBytes  92.3 Mbits/sec
> > >>> [  3]  4.0- 5.0 sec  10.9 MBytes  91.2 Mbits/sec
> > >>> [  3]  5.0- 6.0 sec  10.9 MBytes  91.2 Mbits/sec
> > >>> [  3]  6.0- 7.0 sec  10.8 MBytes  90.2 Mbits/sec
> > >>> [  3]  7.0- 8.0 sec  10.9 MBytes  91.2 Mbits/sec
> > >>> [  3]  8.0- 9.0 sec  10.9 MBytes  91.2 Mbits/sec
> > >>> [  3]  9.0-10.0 sec  10.9 MBytes  91.2 Mbits/sec
> > >>> [  3]  0.0-10.0 sec   109 MBytes  91.4 Mbits/sec
> > >>>
> > >>> We were able to bisect this down to this commit:
> > >>>
> > >>> first bad commit: [65466904b015f6eeb9225b51aeb29b01a1d4b59c] tcp: adjust
> > >>> TSO packet sizes based on min_rtt
> > >>>
> > >>> Disabling this new setting via:
> > >>>
> > >>> echo 0 > /proc/sys/net/ipv4/tcp_tso_rtt_log
> > >>>
> > >>> confirm that this was the cause of the performance regression.
> > >>>
> > >>> Is it expected that the new default setting has such a performance impact?
> > > Indeed, thanks for the report.
> > >
> > > In addition to the "ss" output Eric mentioned, could you please grab
> > > "nstat" output, which should allow us to calculate the average TSO/GSO
> > > and LRO/GRO burst sizes, which is the key thing tuned with the
> > > tcp_tso_rtt_log knob.
> > >
> > > So it would be great to have the following from both data sender and
> > > data receiver, for both the good case and bad case, if you could start
> > > these before your test and kill them after the test stops:
> > >
> > > (while true; do date; ss -tenmoi; sleep 1; done) > /root/ss.txt &
> > > nstat -n; (while true; do date; nstat; sleep 1; done)  > /root/nstat.txt
> > i upload everything here:
> > https://github.com/lategoodbye/tcp_tso_rtt_log_regress
> >
> > The server part is a Ubuntu installation connected to the internet. At
> > first i logged the good case, then i continued with the bad case.
> > Accidentally i delete a log file of bad case, so i repeated the whole
> > bad case again. So the uploaded bad case files are from the third run.
>
> Thanks for the detailed data!
>
> Here are some notes from looking at this data:
>
> + bad client: avg TSO burst size is roughly:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_bad.log
> IpOutRequests                   308               44.7
> IpExtOutOctets                  10050656        1403181.0
> est bytes   per TSO burst: 10050656 / 308 = 32632
> est packets per TSO burst: 32632 / 1448 ~= 22.5
>
> + good client: avg TSO burst size is roughly:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_good.log
> IpOutRequests                   529               62.0
> IpExtOutOctets                  11502992        1288711.5
> est bytes   per TSO burst: 11502992 / 529 ~= 21745
> est packets per TSO burst: 21745 / 1448 ~= 15.0
>
> + bad client ss data:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_bad.log
> State Recv-Q Send-Q Local Address:Port   Peer Address:PortProcess
> ESTAB 0      236024  192.168.1.12:39228 192.168.1.129:5001
> timer:(on,030ms,0) ino:25876 sk:414f52af rto:0.21 cwnd:68 ssthresh:20
> reordering:0
> Mbits/sec allowed by cwnd: 68 * 1448 * 8 / .0018 / 1000000.0 ~= 437.6
>
> + good client ss data:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_good.log
> Fri Oct 13 15:04:36 CEST 2023
> State Recv-Q Send-Q Local Address:Port   Peer Address:PortProcess
> ESTAB 0      425712  192.168.1.12:33284 192.168.1.129:5001
> timer:(on,020ms,0) ino:20654 sk:414f52af rto:0.21 cwnd:106 ssthresh:20
> reordering:0
> Mbits/sec allowed by cwnd: 106 * 1448 * 8 / .0028 / 1000000.0 = 438.5
>
> So it seems indeed like cwnd is not the limiting factor, and instead
> there is something about the larger TSO/GSO bursts (roughly 22.5
> packets per burst on average) in the "bad" case that is causing
> problems, and preventing the sender from keeping the pipe fully
> utilized.
>
> So perhaps the details of the tcp_tso_should_defer() logic are hurting
> performance?
>
> The default value of tcp_tso_win_divisor is 3, and in the bad case the
> cwnd / tcp_tso_win_divisor = 68 / 3 = 22.7 packets, which is
> suspiciously close to the average TSO burst size of 22.5. So my guess
> is that the tcp_tso_win_divisor of 3 is the dominant factor here, and
> perhaps if we raise it to 5, then 68/5 ~= 13.60 will approximate the
> TSO burst size in the "good" case, and fully utilize the pipe. So it
> seems worth an experiment, to see what we can learn.
>
> To test that theory, could you please try running the following as
> root on the data sender machine, and then re-running the "bad" test
> with tcp_tso_rtt_log at the default value of 9?
>
>    sysctl net.ipv4.tcp_tso_win_divisor=5
>
> Thanks!
> neal
>


-- 
Oct 30: https://netdevconf.info/0x17/news/the-maestro-and-the-music-bof.html
Dave Täht CSO, LibreQos

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iperf performance regression since Linux 5.18
  2023-10-14 19:40       ` Neal Cardwell
  2023-10-14 21:16         ` Dave Taht
@ 2023-10-14 22:51         ` Eric Dumazet
  2023-10-14 23:24           ` Neal Cardwell
  2023-10-14 23:26           ` Stefan Wahren
  2023-10-15  0:06         ` Stefan Wahren
  2 siblings, 2 replies; 20+ messages in thread
From: Eric Dumazet @ 2023-10-14 22:51 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Stefan Wahren, Jakub Kicinski, Fabio Estevam, linux-imx,
	Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng

On Sat, Oct 14, 2023 at 9:40 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Fri, Oct 13, 2023 at 9:37 AM Stefan Wahren <wahrenst@gmx.net> wrote:
> >
> > Hi,
> >
> > Am 09.10.23 um 21:19 schrieb Neal Cardwell:
> > > On Mon, Oct 9, 2023 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
> > >> On Mon, Oct 9, 2023 at 8:58 PM Stefan Wahren <wahrenst@gmx.net> wrote:
> > >>> Hi,
> > >>> we recently switched on our ARM NXP i.MX6ULL based embedded device
> > >>> (Tarragon Master [1]) from an older kernel version to Linux 6.1. After
> > >>> that we noticed a measurable performance regression on the Ethernet
> > >>> interface (driver: fec, 100 Mbit link) while running iperf client on the
> > >>> device:
> > >>>
> > >>> BAD
> > >>>
> > >>> # iperf -t 10 -i 1 -c 192.168.1.129
> > >>> ------------------------------------------------------------
> > >>> Client connecting to 192.168.1.129, TCP port 5001
> > >>> TCP window size: 96.2 KByte (default)
> > >>> ------------------------------------------------------------
> > >>> [  3] local 192.168.1.12 port 56022 connected with 192.168.1.129 port 5001
> > >>> [ ID] Interval       Transfer     Bandwidth
> > >>> [  3]  0.0- 1.0 sec  9.88 MBytes  82.8 Mbits/sec
> > >>> [  3]  1.0- 2.0 sec  9.62 MBytes  80.7 Mbits/sec
> > >>> [  3]  2.0- 3.0 sec  9.75 MBytes  81.8 Mbits/sec
> > >>> [  3]  3.0- 4.0 sec  9.62 MBytes  80.7 Mbits/sec
> > >>> [  3]  4.0- 5.0 sec  9.62 MBytes  80.7 Mbits/sec
> > >>> [  3]  5.0- 6.0 sec  9.62 MBytes  80.7 Mbits/sec
> > >>> [  3]  6.0- 7.0 sec  9.50 MBytes  79.7 Mbits/sec
> > >>> [  3]  7.0- 8.0 sec  9.75 MBytes  81.8 Mbits/sec
> > >>> [  3]  8.0- 9.0 sec  9.62 MBytes  80.7 Mbits/sec
> > >>> [  3]  9.0-10.0 sec  9.50 MBytes  79.7 Mbits/sec
> > >>> [  3]  0.0-10.0 sec  96.5 MBytes  80.9 Mbits/sec
> > >>>
> > >>> GOOD
> > >>>
> > >>> # iperf -t 10 -i 1 -c 192.168.1.129
> > >>> ------------------------------------------------------------
> > >>> Client connecting to 192.168.1.129, TCP port 5001
> > >>> TCP window size: 96.2 KByte (default)
> > >>> ------------------------------------------------------------
> > >>> [  3] local 192.168.1.12 port 54898 connected with 192.168.1.129 port 5001
> > >>> [ ID] Interval       Transfer     Bandwidth
> > >>> [  3]  0.0- 1.0 sec  11.2 MBytes  94.4 Mbits/sec
> > >>> [  3]  1.0- 2.0 sec  11.0 MBytes  92.3 Mbits/sec
> > >>> [  3]  2.0- 3.0 sec  10.8 MBytes  90.2 Mbits/sec
> > >>> [  3]  3.0- 4.0 sec  11.0 MBytes  92.3 Mbits/sec
> > >>> [  3]  4.0- 5.0 sec  10.9 MBytes  91.2 Mbits/sec
> > >>> [  3]  5.0- 6.0 sec  10.9 MBytes  91.2 Mbits/sec
> > >>> [  3]  6.0- 7.0 sec  10.8 MBytes  90.2 Mbits/sec
> > >>> [  3]  7.0- 8.0 sec  10.9 MBytes  91.2 Mbits/sec
> > >>> [  3]  8.0- 9.0 sec  10.9 MBytes  91.2 Mbits/sec
> > >>> [  3]  9.0-10.0 sec  10.9 MBytes  91.2 Mbits/sec
> > >>> [  3]  0.0-10.0 sec   109 MBytes  91.4 Mbits/sec
> > >>>
> > >>> We were able to bisect this down to this commit:
> > >>>
> > >>> first bad commit: [65466904b015f6eeb9225b51aeb29b01a1d4b59c] tcp: adjust
> > >>> TSO packet sizes based on min_rtt
> > >>>
> > >>> Disabling this new setting via:
> > >>>
> > >>> echo 0 > /proc/sys/net/ipv4/tcp_tso_rtt_log
> > >>>
> > >>> confirm that this was the cause of the performance regression.
> > >>>
> > >>> Is it expected that the new default setting has such a performance impact?
> > > Indeed, thanks for the report.
> > >
> > > In addition to the "ss" output Eric mentioned, could you please grab
> > > "nstat" output, which should allow us to calculate the average TSO/GSO
> > > and LRO/GRO burst sizes, which is the key thing tuned with the
> > > tcp_tso_rtt_log knob.
> > >
> > > So it would be great to have the following from both data sender and
> > > data receiver, for both the good case and bad case, if you could start
> > > these before your test and kill them after the test stops:
> > >
> > > (while true; do date; ss -tenmoi; sleep 1; done) > /root/ss.txt &
> > > nstat -n; (while true; do date; nstat; sleep 1; done)  > /root/nstat.txt
> > i upload everything here:
> > https://github.com/lategoodbye/tcp_tso_rtt_log_regress
> >
> > The server part is a Ubuntu installation connected to the internet. At
> > first i logged the good case, then i continued with the bad case.
> > Accidentally i delete a log file of bad case, so i repeated the whole
> > bad case again. So the uploaded bad case files are from the third run.
>
> Thanks for the detailed data!
>
> Here are some notes from looking at this data:
>
> + bad client: avg TSO burst size is roughly:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_bad.log
> IpOutRequests                   308               44.7
> IpExtOutOctets                  10050656        1403181.0
> est bytes   per TSO burst: 10050656 / 308 = 32632
> est packets per TSO burst: 32632 / 1448 ~= 22.5
>
> + good client: avg TSO burst size is roughly:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_good.log
> IpOutRequests                   529               62.0
> IpExtOutOctets                  11502992        1288711.5
> est bytes   per TSO burst: 11502992 / 529 ~= 21745
> est packets per TSO burst: 21745 / 1448 ~= 15.0
>
> + bad client ss data:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_bad.log
> State Recv-Q Send-Q Local Address:Port   Peer Address:PortProcess
> ESTAB 0      236024  192.168.1.12:39228 192.168.1.129:5001
> timer:(on,030ms,0) ino:25876 sk:414f52af rto:0.21 cwnd:68 ssthresh:20
> reordering:0
> Mbits/sec allowed by cwnd: 68 * 1448 * 8 / .0018 / 1000000.0 ~= 437.6
>
> + good client ss data:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_good.log
> Fri Oct 13 15:04:36 CEST 2023
> State Recv-Q Send-Q Local Address:Port   Peer Address:PortProcess
> ESTAB 0      425712  192.168.1.12:33284 192.168.1.129:5001
> timer:(on,020ms,0) ino:20654 sk:414f52af rto:0.21 cwnd:106 ssthresh:20
> reordering:0
> Mbits/sec allowed by cwnd: 106 * 1448 * 8 / .0028 / 1000000.0 = 438.5
>
> So it seems indeed like cwnd is not the limiting factor, and instead
> there is something about the larger TSO/GSO bursts (roughly 22.5
> packets per burst on average) in the "bad" case that is causing
> problems, and preventing the sender from keeping the pipe fully
> utilized.
>
> So perhaps the details of the tcp_tso_should_defer() logic are hurting
> performance?
>
> The default value of tcp_tso_win_divisor is 3, and in the bad case the
> cwnd / tcp_tso_win_divisor = 68 / 3 = 22.7 packets, which is
> suspiciously close to the average TSO burst size of 22.5. So my guess
> is that the tcp_tso_win_divisor of 3 is the dominant factor here, and
> perhaps if we raise it to 5, then 68/5 ~= 13.60 will approximate the
> TSO burst size in the "good" case, and fully utilize the pipe. So it
> seems worth an experiment, to see what we can learn.
>
> To test that theory, could you please try running the following as
> root on the data sender machine, and then re-running the "bad" test
> with tcp_tso_rtt_log at the default value of 9?
>
>    sysctl net.ipv4.tcp_tso_win_divisor=5
>
> Thanks!
> neal

Hmm, we receive ~3200 acks per second, I am not sure the
tcp_tso_should_defer() logic
would hurt ?

Also the ss binary on the client seems very old, or its output has
been mangled perhaps ?

State Recv-Q Send-Q Local Address:Port   Peer Address:PortProcess
ESTAB 0      492320  192.168.1.12:33284 192.168.1.129:5001
timer:(on,030ms,0) ino:20654 sk:414f52af rto:0.21 cwnd:106 ssthresh:20
reordering:0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iperf performance regression since Linux 5.18
  2023-10-14 22:51         ` Eric Dumazet
@ 2023-10-14 23:24           ` Neal Cardwell
  2023-10-14 23:26           ` Stefan Wahren
  1 sibling, 0 replies; 20+ messages in thread
From: Neal Cardwell @ 2023-10-14 23:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stefan Wahren, Jakub Kicinski, Fabio Estevam, linux-imx,
	Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng

On Sat, Oct 14, 2023 at 6:51 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sat, Oct 14, 2023 at 9:40 PM Neal Cardwell <ncardwell@google.com> wrote:
> >
> > On Fri, Oct 13, 2023 at 9:37 AM Stefan Wahren <wahrenst@gmx.net> wrote:
> > >
> > > Hi,
> > >
> > > Am 09.10.23 um 21:19 schrieb Neal Cardwell:
> > > > On Mon, Oct 9, 2023 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >> On Mon, Oct 9, 2023 at 8:58 PM Stefan Wahren <wahrenst@gmx.net> wrote:
> > > >>> Hi,
> > > >>> we recently switched on our ARM NXP i.MX6ULL based embedded device
> > > >>> (Tarragon Master [1]) from an older kernel version to Linux 6.1. After
> > > >>> that we noticed a measurable performance regression on the Ethernet
> > > >>> interface (driver: fec, 100 Mbit link) while running iperf client on the
> > > >>> device:
> > > >>>
> > > >>> BAD
> > > >>>
> > > >>> # iperf -t 10 -i 1 -c 192.168.1.129
> > > >>> ------------------------------------------------------------
> > > >>> Client connecting to 192.168.1.129, TCP port 5001
> > > >>> TCP window size: 96.2 KByte (default)
> > > >>> ------------------------------------------------------------
> > > >>> [  3] local 192.168.1.12 port 56022 connected with 192.168.1.129 port 5001
> > > >>> [ ID] Interval       Transfer     Bandwidth
> > > >>> [  3]  0.0- 1.0 sec  9.88 MBytes  82.8 Mbits/sec
> > > >>> [  3]  1.0- 2.0 sec  9.62 MBytes  80.7 Mbits/sec
> > > >>> [  3]  2.0- 3.0 sec  9.75 MBytes  81.8 Mbits/sec
> > > >>> [  3]  3.0- 4.0 sec  9.62 MBytes  80.7 Mbits/sec
> > > >>> [  3]  4.0- 5.0 sec  9.62 MBytes  80.7 Mbits/sec
> > > >>> [  3]  5.0- 6.0 sec  9.62 MBytes  80.7 Mbits/sec
> > > >>> [  3]  6.0- 7.0 sec  9.50 MBytes  79.7 Mbits/sec
> > > >>> [  3]  7.0- 8.0 sec  9.75 MBytes  81.8 Mbits/sec
> > > >>> [  3]  8.0- 9.0 sec  9.62 MBytes  80.7 Mbits/sec
> > > >>> [  3]  9.0-10.0 sec  9.50 MBytes  79.7 Mbits/sec
> > > >>> [  3]  0.0-10.0 sec  96.5 MBytes  80.9 Mbits/sec
> > > >>>
> > > >>> GOOD
> > > >>>
> > > >>> # iperf -t 10 -i 1 -c 192.168.1.129
> > > >>> ------------------------------------------------------------
> > > >>> Client connecting to 192.168.1.129, TCP port 5001
> > > >>> TCP window size: 96.2 KByte (default)
> > > >>> ------------------------------------------------------------
> > > >>> [  3] local 192.168.1.12 port 54898 connected with 192.168.1.129 port 5001
> > > >>> [ ID] Interval       Transfer     Bandwidth
> > > >>> [  3]  0.0- 1.0 sec  11.2 MBytes  94.4 Mbits/sec
> > > >>> [  3]  1.0- 2.0 sec  11.0 MBytes  92.3 Mbits/sec
> > > >>> [  3]  2.0- 3.0 sec  10.8 MBytes  90.2 Mbits/sec
> > > >>> [  3]  3.0- 4.0 sec  11.0 MBytes  92.3 Mbits/sec
> > > >>> [  3]  4.0- 5.0 sec  10.9 MBytes  91.2 Mbits/sec
> > > >>> [  3]  5.0- 6.0 sec  10.9 MBytes  91.2 Mbits/sec
> > > >>> [  3]  6.0- 7.0 sec  10.8 MBytes  90.2 Mbits/sec
> > > >>> [  3]  7.0- 8.0 sec  10.9 MBytes  91.2 Mbits/sec
> > > >>> [  3]  8.0- 9.0 sec  10.9 MBytes  91.2 Mbits/sec
> > > >>> [  3]  9.0-10.0 sec  10.9 MBytes  91.2 Mbits/sec
> > > >>> [  3]  0.0-10.0 sec   109 MBytes  91.4 Mbits/sec
> > > >>>
> > > >>> We were able to bisect this down to this commit:
> > > >>>
> > > >>> first bad commit: [65466904b015f6eeb9225b51aeb29b01a1d4b59c] tcp: adjust
> > > >>> TSO packet sizes based on min_rtt
> > > >>>
> > > >>> Disabling this new setting via:
> > > >>>
> > > >>> echo 0 > /proc/sys/net/ipv4/tcp_tso_rtt_log
> > > >>>
> > > >>> confirm that this was the cause of the performance regression.
> > > >>>
> > > >>> Is it expected that the new default setting has such a performance impact?
> > > > Indeed, thanks for the report.
> > > >
> > > > In addition to the "ss" output Eric mentioned, could you please grab
> > > > "nstat" output, which should allow us to calculate the average TSO/GSO
> > > > and LRO/GRO burst sizes, which is the key thing tuned with the
> > > > tcp_tso_rtt_log knob.
> > > >
> > > > So it would be great to have the following from both data sender and
> > > > data receiver, for both the good case and bad case, if you could start
> > > > these before your test and kill them after the test stops:
> > > >
> > > > (while true; do date; ss -tenmoi; sleep 1; done) > /root/ss.txt &
> > > > nstat -n; (while true; do date; nstat; sleep 1; done)  > /root/nstat.txt
> > > i upload everything here:
> > > https://github.com/lategoodbye/tcp_tso_rtt_log_regress
> > >
> > > The server part is a Ubuntu installation connected to the internet. At
> > > first i logged the good case, then i continued with the bad case.
> > > Accidentally i delete a log file of bad case, so i repeated the whole
> > > bad case again. So the uploaded bad case files are from the third run.
> >
> > Thanks for the detailed data!
> >
> > Here are some notes from looking at this data:
> >
> > + bad client: avg TSO burst size is roughly:
> > https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_bad.log
> > IpOutRequests                   308               44.7
> > IpExtOutOctets                  10050656        1403181.0
> > est bytes   per TSO burst: 10050656 / 308 = 32632
> > est packets per TSO burst: 32632 / 1448 ~= 22.5
> >
> > + good client: avg TSO burst size is roughly:
> > https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_good.log
> > IpOutRequests                   529               62.0
> > IpExtOutOctets                  11502992        1288711.5
> > est bytes   per TSO burst: 11502992 / 529 ~= 21745
> > est packets per TSO burst: 21745 / 1448 ~= 15.0
> >
> > + bad client ss data:
> > https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_bad.log
> > State Recv-Q Send-Q Local Address:Port   Peer Address:PortProcess
> > ESTAB 0      236024  192.168.1.12:39228 192.168.1.129:5001
> > timer:(on,030ms,0) ino:25876 sk:414f52af rto:0.21 cwnd:68 ssthresh:20
> > reordering:0
> > Mbits/sec allowed by cwnd: 68 * 1448 * 8 / .0018 / 1000000.0 ~= 437.6
> >
> > + good client ss data:
> > https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_good.log
> > Fri Oct 13 15:04:36 CEST 2023
> > State Recv-Q Send-Q Local Address:Port   Peer Address:PortProcess
> > ESTAB 0      425712  192.168.1.12:33284 192.168.1.129:5001
> > timer:(on,020ms,0) ino:20654 sk:414f52af rto:0.21 cwnd:106 ssthresh:20
> > reordering:0
> > Mbits/sec allowed by cwnd: 106 * 1448 * 8 / .0028 / 1000000.0 = 438.5
> >
> > So it seems indeed like cwnd is not the limiting factor, and instead
> > there is something about the larger TSO/GSO bursts (roughly 22.5
> > packets per burst on average) in the "bad" case that is causing
> > problems, and preventing the sender from keeping the pipe fully
> > utilized.
> >
> > So perhaps the details of the tcp_tso_should_defer() logic are hurting
> > performance?
> >
> > The default value of tcp_tso_win_divisor is 3, and in the bad case the
> > cwnd / tcp_tso_win_divisor = 68 / 3 = 22.7 packets, which is
> > suspiciously close to the average TSO burst size of 22.5. So my guess
> > is that the tcp_tso_win_divisor of 3 is the dominant factor here, and
> > perhaps if we raise it to 5, then 68/5 ~= 13.60 will approximate the
> > TSO burst size in the "good" case, and fully utilize the pipe. So it
> > seems worth an experiment, to see what we can learn.
> >
> > To test that theory, could you please try running the following as
> > root on the data sender machine, and then re-running the "bad" test
> > with tcp_tso_rtt_log at the default value of 9?
> >
> >    sysctl net.ipv4.tcp_tso_win_divisor=5
> >
> > Thanks!
> > neal
>
> Hmm, we receive ~3200 acks per second, I am not sure the
> tcp_tso_should_defer() logic
> would hurt ?
>
> Also the ss binary on the client seems very old, or its output has
> been mangled perhaps ?
>
> State Recv-Q Send-Q Local Address:Port   Peer Address:PortProcess
> ESTAB 0      492320  192.168.1.12:33284 192.168.1.129:5001
> timer:(on,030ms,0) ino:20654 sk:414f52af rto:0.21 cwnd:106 ssthresh:20
> reordering:0

Yes, agreed, it would be super-useful if the next run could have both
a packet capture and newer/fuller ss output; perhaps something like:

(a) to run tcpdump on the data sender:

tcpdump -w /root/dump.pcap -n -s 116 -c 1000000 host $REMOTE_HOST -i
$INTERFACE &

(b) to build a newer ss:

git clone git://git.kernel.org/pub/scm/network/iproute2/iproute2.git
cd iproute2/
./configure
make

Thanks!
neal

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iperf performance regression since Linux 5.18
  2023-10-14 22:51         ` Eric Dumazet
  2023-10-14 23:24           ` Neal Cardwell
@ 2023-10-14 23:26           ` Stefan Wahren
  2023-10-15 10:23             ` Stefan Wahren
  1 sibling, 1 reply; 20+ messages in thread
From: Stefan Wahren @ 2023-10-14 23:26 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell
  Cc: Jakub Kicinski, Fabio Estevam, linux-imx, Stefan Wahren,
	Michael Heimpold, netdev, Yuchung Cheng

Hi Eric,

Am 15.10.23 um 00:51 schrieb Eric Dumazet:
> On Sat, Oct 14, 2023 at 9:40 PM Neal Cardwell <ncardwell@google.com> wrote:
>> On Fri, Oct 13, 2023 at 9:37 AM Stefan Wahren <wahrenst@gmx.net> wrote:
>>> Hi,
>>>
>>> Am 09.10.23 um 21:19 schrieb Neal Cardwell:
>>>> On Mon, Oct 9, 2023 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
>>>>> On Mon, Oct 9, 2023 at 8:58 PM Stefan Wahren <wahrenst@gmx.net> wrote:
>>>>>> Hi,
>>>>>> we recently switched on our ARM NXP i.MX6ULL based embedded device
>>>>>> (Tarragon Master [1]) from an older kernel version to Linux 6.1. After
>>>>>> that we noticed a measurable performance regression on the Ethernet
>>>>>> interface (driver: fec, 100 Mbit link) while running iperf client on the
>>>>>> device:
>>>>>>
>>>>>> BAD
>>>>>>
>>>>>> # iperf -t 10 -i 1 -c 192.168.1.129
>>>>>> ------------------------------------------------------------
>>>>>> Client connecting to 192.168.1.129, TCP port 5001
>>>>>> TCP window size: 96.2 KByte (default)
>>>>>> ------------------------------------------------------------
>>>>>> [  3] local 192.168.1.12 port 56022 connected with 192.168.1.129 port 5001
>>>>>> [ ID] Interval       Transfer     Bandwidth
>>>>>> [  3]  0.0- 1.0 sec  9.88 MBytes  82.8 Mbits/sec
>>>>>> [  3]  1.0- 2.0 sec  9.62 MBytes  80.7 Mbits/sec
>>>>>> [  3]  2.0- 3.0 sec  9.75 MBytes  81.8 Mbits/sec
>>>>>> [  3]  3.0- 4.0 sec  9.62 MBytes  80.7 Mbits/sec
>>>>>> [  3]  4.0- 5.0 sec  9.62 MBytes  80.7 Mbits/sec
>>>>>> [  3]  5.0- 6.0 sec  9.62 MBytes  80.7 Mbits/sec
>>>>>> [  3]  6.0- 7.0 sec  9.50 MBytes  79.7 Mbits/sec
>>>>>> [  3]  7.0- 8.0 sec  9.75 MBytes  81.8 Mbits/sec
>>>>>> [  3]  8.0- 9.0 sec  9.62 MBytes  80.7 Mbits/sec
>>>>>> [  3]  9.0-10.0 sec  9.50 MBytes  79.7 Mbits/sec
>>>>>> [  3]  0.0-10.0 sec  96.5 MBytes  80.9 Mbits/sec
>>>>>>
>>>>>> GOOD
>>>>>>
>>>>>> # iperf -t 10 -i 1 -c 192.168.1.129
>>>>>> ------------------------------------------------------------
>>>>>> Client connecting to 192.168.1.129, TCP port 5001
>>>>>> TCP window size: 96.2 KByte (default)
>>>>>> ------------------------------------------------------------
>>>>>> [  3] local 192.168.1.12 port 54898 connected with 192.168.1.129 port 5001
>>>>>> [ ID] Interval       Transfer     Bandwidth
>>>>>> [  3]  0.0- 1.0 sec  11.2 MBytes  94.4 Mbits/sec
>>>>>> [  3]  1.0- 2.0 sec  11.0 MBytes  92.3 Mbits/sec
>>>>>> [  3]  2.0- 3.0 sec  10.8 MBytes  90.2 Mbits/sec
>>>>>> [  3]  3.0- 4.0 sec  11.0 MBytes  92.3 Mbits/sec
>>>>>> [  3]  4.0- 5.0 sec  10.9 MBytes  91.2 Mbits/sec
>>>>>> [  3]  5.0- 6.0 sec  10.9 MBytes  91.2 Mbits/sec
>>>>>> [  3]  6.0- 7.0 sec  10.8 MBytes  90.2 Mbits/sec
>>>>>> [  3]  7.0- 8.0 sec  10.9 MBytes  91.2 Mbits/sec
>>>>>> [  3]  8.0- 9.0 sec  10.9 MBytes  91.2 Mbits/sec
>>>>>> [  3]  9.0-10.0 sec  10.9 MBytes  91.2 Mbits/sec
>>>>>> [  3]  0.0-10.0 sec   109 MBytes  91.4 Mbits/sec
>>>>>>
>>>>>> We were able to bisect this down to this commit:
>>>>>>
>>>>>> first bad commit: [65466904b015f6eeb9225b51aeb29b01a1d4b59c] tcp: adjust
>>>>>> TSO packet sizes based on min_rtt
>>>>>>
>>>>>> Disabling this new setting via:
>>>>>>
>>>>>> echo 0 > /proc/sys/net/ipv4/tcp_tso_rtt_log
>>>>>>
>>>>>> confirm that this was the cause of the performance regression.
>>>>>>
>>>>>> Is it expected that the new default setting has such a performance impact?
>>>> Indeed, thanks for the report.
>>>>
>>>> In addition to the "ss" output Eric mentioned, could you please grab
>>>> "nstat" output, which should allow us to calculate the average TSO/GSO
>>>> and LRO/GRO burst sizes, which is the key thing tuned with the
>>>> tcp_tso_rtt_log knob.
>>>>
>>>> So it would be great to have the following from both data sender and
>>>> data receiver, for both the good case and bad case, if you could start
>>>> these before your test and kill them after the test stops:
>>>>
>>>> (while true; do date; ss -tenmoi; sleep 1; done) > /root/ss.txt &
>>>> nstat -n; (while true; do date; nstat; sleep 1; done)  > /root/nstat.txt
>>> i upload everything here:
>>> https://github.com/lategoodbye/tcp_tso_rtt_log_regress
>>>
>>> The server part is a Ubuntu installation connected to the internet. At
>>> first i logged the good case, then i continued with the bad case.
>>> Accidentally i delete a log file of bad case, so i repeated the whole
>>> bad case again. So the uploaded bad case files are from the third run.
>> Thanks for the detailed data!
>>
>> Here are some notes from looking at this data:
>>
>> + bad client: avg TSO burst size is roughly:
>> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_bad.log
>> IpOutRequests                   308               44.7
>> IpExtOutOctets                  10050656        1403181.0
>> est bytes   per TSO burst: 10050656 / 308 = 32632
>> est packets per TSO burst: 32632 / 1448 ~= 22.5
>>
>> + good client: avg TSO burst size is roughly:
>> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_good.log
>> IpOutRequests                   529               62.0
>> IpExtOutOctets                  11502992        1288711.5
>> est bytes   per TSO burst: 11502992 / 529 ~= 21745
>> est packets per TSO burst: 21745 / 1448 ~= 15.0
>>
>> + bad client ss data:
>> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_bad.log
>> State Recv-Q Send-Q Local Address:Port   Peer Address:PortProcess
>> ESTAB 0      236024  192.168.1.12:39228 192.168.1.129:5001
>> timer:(on,030ms,0) ino:25876 sk:414f52af rto:0.21 cwnd:68 ssthresh:20
>> reordering:0
>> Mbits/sec allowed by cwnd: 68 * 1448 * 8 / .0018 / 1000000.0 ~= 437.6
>>
>> + good client ss data:
>> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_good.log
>> Fri Oct 13 15:04:36 CEST 2023
>> State Recv-Q Send-Q Local Address:Port   Peer Address:PortProcess
>> ESTAB 0      425712  192.168.1.12:33284 192.168.1.129:5001
>> timer:(on,020ms,0) ino:20654 sk:414f52af rto:0.21 cwnd:106 ssthresh:20
>> reordering:0
>> Mbits/sec allowed by cwnd: 106 * 1448 * 8 / .0028 / 1000000.0 = 438.5
>>
>> So it seems indeed like cwnd is not the limiting factor, and instead
>> there is something about the larger TSO/GSO bursts (roughly 22.5
>> packets per burst on average) in the "bad" case that is causing
>> problems, and preventing the sender from keeping the pipe fully
>> utilized.
>>
>> So perhaps the details of the tcp_tso_should_defer() logic are hurting
>> performance?
>>
>> The default value of tcp_tso_win_divisor is 3, and in the bad case the
>> cwnd / tcp_tso_win_divisor = 68 / 3 = 22.7 packets, which is
>> suspiciously close to the average TSO burst size of 22.5. So my guess
>> is that the tcp_tso_win_divisor of 3 is the dominant factor here, and
>> perhaps if we raise it to 5, then 68/5 ~= 13.60 will approximate the
>> TSO burst size in the "good" case, and fully utilize the pipe. So it
>> seems worth an experiment, to see what we can learn.
>>
>> To test that theory, could you please try running the following as
>> root on the data sender machine, and then re-running the "bad" test
>> with tcp_tso_rtt_log at the default value of 9?
>>
>>     sysctl net.ipv4.tcp_tso_win_divisor=5
>>
>> Thanks!
>> neal
> Hmm, we receive ~3200 acks per second, I am not sure the
> tcp_tso_should_defer() logic
> would hurt ?
>
> Also the ss binary on the client seems very old, or its output has
> been mangled perhaps ?
this binary is from Yocto kirkstone:

# ss --version
ss utility, iproute2-5.17.0

This shouldn't be too old. Maybe some missing kernel settings?

Maybe relevant extract from the config:

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
# CONFIG_PROC_CHILDREN is not set
CONFIG_KERNFS=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
# CONFIG_TMPFS_POSIX_ACL is not set
# CONFIG_TMPFS_XATTR is not set
CONFIG_MEMFD_CREATE=y
CONFIG_CONFIGFS_FS=y
# end of Pseudo filesystems

CONFIG_NET=y
CONFIG_NET_EGRESS=y

#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_DIAG is not set
CONFIG_UNIX=y
CONFIG_UNIX_SCM=y
CONFIG_AF_UNIX_OOB=y
# CONFIG_UNIX_DIAG is not set
# CONFIG_TLS is not set
# CONFIG_XFRM_USER is not set
# CONFIG_NET_KEY is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
# CONFIG_IP_ADVANCED_ROUTER is not set
CONFIG_IP_PNP=y
CONFIG_IP_PNP_DHCP=y
# CONFIG_IP_PNP_BOOTP is not set
# CONFIG_IP_PNP_RARP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE_DEMUX is not set
# CONFIG_IP_MROUTE is not set
# CONFIG_SYN_COOKIES is not set
# CONFIG_NET_IPVTI is not set
# CONFIG_NET_FOU is not set
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
CONFIG_INET_TABLE_PERTURB_ORDER=16
# CONFIG_INET_DIAG is not set
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set
CONFIG_IPV6=y
# CONFIG_IPV6_ROUTER_PREF is not set
CONFIG_IPV6_OPTIMISTIC_DAD=y
# CONFIG_INET6_AH is not set
# CONFIG_INET6_ESP is not set
# CONFIG_INET6_IPCOMP is not set
# CONFIG_IPV6_MIP6 is not set
# CONFIG_IPV6_ILA is not set
# CONFIG_IPV6_VTI is not set
# CONFIG_IPV6_SIT is not set
# CONFIG_IPV6_TUNNEL is not set
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_IPV6_MROUTE is not set
# CONFIG_IPV6_SEG6_LWTUNNEL is not set
# CONFIG_IPV6_SEG6_HMAC is not set
# CONFIG_IPV6_RPL_LWTUNNEL is not set
# CONFIG_IPV6_IOAM6_LWTUNNEL is not set
# CONFIG_MPTCP is not set
# CONFIG_NETWORK_SECMARK is not set
CONFIG_NET_PTP_CLASSIFY=y
# CONFIG_NETWORK_PHY_TIMESTAMPING is not set
CONFIG_NETFILTER=y
CONFIG_NETFILTER_ADVANCED=y
# CONFIG_BRIDGE_NETFILTER is not set

#
# Core Netfilter Configuration
#
# CONFIG_NETFILTER_INGRESS is not set
CONFIG_NETFILTER_EGRESS=y
# CONFIG_NETFILTER_NETLINK_ACCT is not set
# CONFIG_NETFILTER_NETLINK_QUEUE is not set
# CONFIG_NETFILTER_NETLINK_LOG is not set
# CONFIG_NETFILTER_NETLINK_OSF is not set
CONFIG_NF_CONNTRACK=y
# CONFIG_NF_LOG_SYSLOG is not set
# CONFIG_NF_CONNTRACK_MARK is not set
# CONFIG_NF_CONNTRACK_ZONES is not set
# CONFIG_NF_CONNTRACK_PROCFS is not set
# CONFIG_NF_CONNTRACK_EVENTS is not set
# CONFIG_NF_CONNTRACK_TIMEOUT is not set
# CONFIG_NF_CONNTRACK_TIMESTAMP is not set
# CONFIG_NF_CONNTRACK_LABELS is not set
CONFIG_NF_CT_PROTO_DCCP=y
CONFIG_NF_CT_PROTO_SCTP=y
CONFIG_NF_CT_PROTO_UDPLITE=y
# CONFIG_NF_CONNTRACK_AMANDA is not set
CONFIG_NF_CONNTRACK_FTP=y
# CONFIG_NF_CONNTRACK_H323 is not set
# CONFIG_NF_CONNTRACK_IRC is not set
# CONFIG_NF_CONNTRACK_NETBIOS_NS is not set
# CONFIG_NF_CONNTRACK_SNMP is not set
# CONFIG_NF_CONNTRACK_PPTP is not set
# CONFIG_NF_CONNTRACK_SANE is not set
# CONFIG_NF_CONNTRACK_SIP is not set
# CONFIG_NF_CONNTRACK_TFTP is not set
# CONFIG_NF_CT_NETLINK is not set
CONFIG_NF_NAT=y
CONFIG_NF_NAT_FTP=y
CONFIG_NF_NAT_MASQUERADE=y
# CONFIG_NF_TABLES is not set
CONFIG_NETFILTER_XTABLES=y

#
# Xtables combined modules
#
# CONFIG_NETFILTER_XT_MARK is not set
# CONFIG_NETFILTER_XT_CONNMARK is not set

#
# Xtables targets
#
# CONFIG_NETFILTER_XT_TARGET_CLASSIFY is not set
# CONFIG_NETFILTER_XT_TARGET_CONNMARK is not set
# CONFIG_NETFILTER_XT_TARGET_HMARK is not set
# CONFIG_NETFILTER_XT_TARGET_IDLETIMER is not set
# CONFIG_NETFILTER_XT_TARGET_LED is not set
# CONFIG_NETFILTER_XT_TARGET_LOG is not set
# CONFIG_NETFILTER_XT_TARGET_MARK is not set
CONFIG_NETFILTER_XT_NAT=y
# CONFIG_NETFILTER_XT_TARGET_NETMAP is not set
# CONFIG_NETFILTER_XT_TARGET_NFLOG is not set
# CONFIG_NETFILTER_XT_TARGET_NFQUEUE is not set
# CONFIG_NETFILTER_XT_TARGET_RATEEST is not set
# CONFIG_NETFILTER_XT_TARGET_REDIRECT is not set
CONFIG_NETFILTER_XT_TARGET_MASQUERADE=y
# CONFIG_NETFILTER_XT_TARGET_TEE is not set
CONFIG_NETFILTER_XT_TARGET_TCPMSS=y

#
# Xtables matches
#
# CONFIG_NETFILTER_XT_MATCH_ADDRTYPE is not set
# CONFIG_NETFILTER_XT_MATCH_BPF is not set
# CONFIG_NETFILTER_XT_MATCH_CGROUP is not set
# CONFIG_NETFILTER_XT_MATCH_CLUSTER is not set
# CONFIG_NETFILTER_XT_MATCH_COMMENT is not set
# CONFIG_NETFILTER_XT_MATCH_CONNBYTES is not set
# CONFIG_NETFILTER_XT_MATCH_CONNLABEL is not set
# CONFIG_NETFILTER_XT_MATCH_CONNLIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_CONNMARK is not set
# CONFIG_NETFILTER_XT_MATCH_CONNTRACK is not set
# CONFIG_NETFILTER_XT_MATCH_CPU is not set
# CONFIG_NETFILTER_XT_MATCH_DCCP is not set
# CONFIG_NETFILTER_XT_MATCH_DEVGROUP is not set
# CONFIG_NETFILTER_XT_MATCH_DSCP is not set
# CONFIG_NETFILTER_XT_MATCH_ECN is not set
# CONFIG_NETFILTER_XT_MATCH_ESP is not set
# CONFIG_NETFILTER_XT_MATCH_HASHLIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_HELPER is not set
# CONFIG_NETFILTER_XT_MATCH_HL is not set
# CONFIG_NETFILTER_XT_MATCH_IPCOMP is not set
# CONFIG_NETFILTER_XT_MATCH_IPRANGE is not set
# CONFIG_NETFILTER_XT_MATCH_L2TP is not set
# CONFIG_NETFILTER_XT_MATCH_LENGTH is not set
# CONFIG_NETFILTER_XT_MATCH_LIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_MAC is not set
# CONFIG_NETFILTER_XT_MATCH_MARK is not set
# CONFIG_NETFILTER_XT_MATCH_MULTIPORT is not set
# CONFIG_NETFILTER_XT_MATCH_NFACCT is not set
# CONFIG_NETFILTER_XT_MATCH_OSF is not set
# CONFIG_NETFILTER_XT_MATCH_OWNER is not set
# CONFIG_NETFILTER_XT_MATCH_PKTTYPE is not set
# CONFIG_NETFILTER_XT_MATCH_QUOTA is not set
# CONFIG_NETFILTER_XT_MATCH_RATEEST is not set
# CONFIG_NETFILTER_XT_MATCH_REALM is not set
# CONFIG_NETFILTER_XT_MATCH_RECENT is not set
# CONFIG_NETFILTER_XT_MATCH_SCTP is not set
# CONFIG_NETFILTER_XT_MATCH_SOCKET is not set
# CONFIG_NETFILTER_XT_MATCH_STATE is not set
# CONFIG_NETFILTER_XT_MATCH_STATISTIC is not set
# CONFIG_NETFILTER_XT_MATCH_STRING is not set
# CONFIG_NETFILTER_XT_MATCH_TCPMSS is not set
# CONFIG_NETFILTER_XT_MATCH_TIME is not set
# CONFIG_NETFILTER_XT_MATCH_U32 is not set
# end of Core Netfilter Configuration

# CONFIG_IP_SET is not set
# CONFIG_IP_VS is not set

#
# IP: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV4=y
# CONFIG_NF_SOCKET_IPV4 is not set
# CONFIG_NF_TPROXY_IPV4 is not set
# CONFIG_NF_DUP_IPV4 is not set
# CONFIG_NF_LOG_ARP is not set
# CONFIG_NF_LOG_IPV4 is not set
# CONFIG_NF_REJECT_IPV4 is not set
CONFIG_IP_NF_IPTABLES=y
# CONFIG_IP_NF_MATCH_AH is not set
# CONFIG_IP_NF_MATCH_ECN is not set
# CONFIG_IP_NF_MATCH_TTL is not set
# CONFIG_IP_NF_FILTER is not set
# CONFIG_IP_NF_TARGET_SYNPROXY is not set
CONFIG_IP_NF_NAT=y
CONFIG_IP_NF_TARGET_MASQUERADE=y
# CONFIG_IP_NF_TARGET_NETMAP is not set
# CONFIG_IP_NF_TARGET_REDIRECT is not set
# CONFIG_IP_NF_MANGLE is not set
# CONFIG_IP_NF_RAW is not set
# CONFIG_IP_NF_ARPTABLES is not set
# end of IP: Netfilter Configuration

#
# IPv6: Netfilter Configuration
#
# CONFIG_NF_SOCKET_IPV6 is not set
# CONFIG_NF_TPROXY_IPV6 is not set
# CONFIG_NF_DUP_IPV6 is not set
# CONFIG_NF_REJECT_IPV6 is not set
# CONFIG_NF_LOG_IPV6 is not set
# CONFIG_IP6_NF_IPTABLES is not set
# end of IPv6: Netfilter Configuration

CONFIG_NF_DEFRAG_IPV6=y
# CONFIG_NF_CONNTRACK_BRIDGE is not set
# CONFIG_BRIDGE_NF_EBTABLES is not set
# CONFIG_BPFILTER is not set
# CONFIG_IP_DCCP is not set
# CONFIG_IP_SCTP is not set
# CONFIG_RDS is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_L2TP is not set
CONFIG_STP=y
CONFIG_BRIDGE=y
CONFIG_BRIDGE_IGMP_SNOOPING=y
CONFIG_BRIDGE_VLAN_FILTERING=y
# CONFIG_BRIDGE_MRP is not set
# CONFIG_BRIDGE_CFM is not set
# CONFIG_NET_DSA is not set
CONFIG_VLAN_8021Q=y
# CONFIG_VLAN_8021Q_GVRP is not set
# CONFIG_VLAN_8021Q_MVRP is not set
CONFIG_LLC=y
# CONFIG_LLC2 is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_PHONET is not set
# CONFIG_6LOWPAN is not set
# CONFIG_IEEE802154 is not set
# CONFIG_NET_SCHED is not set
# CONFIG_DCB is not set
CONFIG_DNS_RESOLVER=y
# CONFIG_BATMAN_ADV is not set
# CONFIG_OPENVSWITCH is not set
# CONFIG_VSOCKETS is not set
# CONFIG_NETLINK_DIAG is not set
# CONFIG_MPLS is not set
# CONFIG_NET_NSH is not set
# CONFIG_HSR is not set
# CONFIG_NET_SWITCHDEV is not set
# CONFIG_NET_L3_MASTER_DEV is not set
# CONFIG_QRTR is not set
# CONFIG_NET_NCSI is not set
# CONFIG_CGROUP_NET_PRIO is not set
# CONFIG_CGROUP_NET_CLASSID is not set
CONFIG_NET_RX_BUSY_POLL=y
CONFIG_BQL=y

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# end of Network testing
# end of Networking options

# CONFIG_HAMRADIO is not set
CONFIG_CAN=y
CONFIG_CAN_RAW=y
CONFIG_CAN_BCM=y
# CONFIG_CAN_GW is not set
# CONFIG_CAN_J1939 is not set
# CONFIG_CAN_ISOTP is not set
# CONFIG_BT is not set
# CONFIG_AF_RXRPC is not set
# CONFIG_AF_KCM is not set
# CONFIG_MCTP is not set
CONFIG_WIRELESS=y
CONFIG_CFG80211=y
# CONFIG_NL80211_TESTMODE is not set
# CONFIG_CFG80211_DEVELOPER_WARNINGS is not set
# CONFIG_CFG80211_CERTIFICATION_ONUS is not set
CONFIG_CFG80211_REQUIRE_SIGNED_REGDB=y
CONFIG_CFG80211_USE_KERNEL_REGDB_KEYS=y
CONFIG_CFG80211_DEFAULT_PS=y
# CONFIG_CFG80211_DEBUGFS is not set
CONFIG_CFG80211_CRDA_SUPPORT=y
# CONFIG_CFG80211_WEXT is not set
CONFIG_MAC80211=y
CONFIG_MAC80211_HAS_RC=y
CONFIG_MAC80211_RC_MINSTREL=y
CONFIG_MAC80211_RC_DEFAULT_MINSTREL=y
CONFIG_MAC80211_RC_DEFAULT="minstrel_ht"
# CONFIG_MAC80211_MESH is not set
CONFIG_MAC80211_LEDS=y
# CONFIG_MAC80211_DEBUGFS is not set
# CONFIG_MAC80211_MESSAGE_TRACING is not set
# CONFIG_MAC80211_DEBUG_MENU is not set
CONFIG_MAC80211_STA_HASH_MAX_SIZE=0
# CONFIG_RFKILL is not set
# CONFIG_NET_9P is not set
# CONFIG_CAIF is not set
# CONFIG_CEPH_LIB is not set
# CONFIG_NFC is not set
# CONFIG_PSAMPLE is not set
# CONFIG_NET_IFE is not set
# CONFIG_LWTUNNEL is not set
CONFIG_NET_SELFTESTS=y
CONFIG_PAGE_POOL=y
# CONFIG_PAGE_POOL_STATS is not set
# CONFIG_FAILOVER is not set
CONFIG_ETHTOOL_NETLINK=y

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iperf performance regression since Linux 5.18
  2023-10-14 19:40       ` Neal Cardwell
  2023-10-14 21:16         ` Dave Taht
  2023-10-14 22:51         ` Eric Dumazet
@ 2023-10-15  0:06         ` Stefan Wahren
  2 siblings, 0 replies; 20+ messages in thread
From: Stefan Wahren @ 2023-10-15  0:06 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Eric Dumazet, Jakub Kicinski, Fabio Estevam, linux-imx,
	Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng


Am 14.10.23 um 21:40 schrieb Neal Cardwell:
> On Fri, Oct 13, 2023 at 9:37 AM Stefan Wahren <wahrenst@gmx.net> wrote:
>> Hi,
>>
>> Am 09.10.23 um 21:19 schrieb Neal Cardwell:
>>> On Mon, Oct 9, 2023 at 3:11 PM Eric Dumazet <edumazet@google.com> wrote:
>>>> On Mon, Oct 9, 2023 at 8:58 PM Stefan Wahren <wahrenst@gmx.net> wrote:
>>>>> Hi,
>>>>> we recently switched on our ARM NXP i.MX6ULL based embedded device
>>>>> (Tarragon Master [1]) from an older kernel version to Linux 6.1. After
>>>>> that we noticed a measurable performance regression on the Ethernet
>>>>> interface (driver: fec, 100 Mbit link) while running iperf client on the
>>>>> device:
>>>>>
>>>>> BAD
>>>>>
>>>>> # iperf -t 10 -i 1 -c 192.168.1.129
>>>>> ------------------------------------------------------------
>>>>> Client connecting to 192.168.1.129, TCP port 5001
>>>>> TCP window size: 96.2 KByte (default)
>>>>> ------------------------------------------------------------
>>>>> [  3] local 192.168.1.12 port 56022 connected with 192.168.1.129 port 5001
>>>>> [ ID] Interval       Transfer     Bandwidth
>>>>> [  3]  0.0- 1.0 sec  9.88 MBytes  82.8 Mbits/sec
>>>>> [  3]  1.0- 2.0 sec  9.62 MBytes  80.7 Mbits/sec
>>>>> [  3]  2.0- 3.0 sec  9.75 MBytes  81.8 Mbits/sec
>>>>> [  3]  3.0- 4.0 sec  9.62 MBytes  80.7 Mbits/sec
>>>>> [  3]  4.0- 5.0 sec  9.62 MBytes  80.7 Mbits/sec
>>>>> [  3]  5.0- 6.0 sec  9.62 MBytes  80.7 Mbits/sec
>>>>> [  3]  6.0- 7.0 sec  9.50 MBytes  79.7 Mbits/sec
>>>>> [  3]  7.0- 8.0 sec  9.75 MBytes  81.8 Mbits/sec
>>>>> [  3]  8.0- 9.0 sec  9.62 MBytes  80.7 Mbits/sec
>>>>> [  3]  9.0-10.0 sec  9.50 MBytes  79.7 Mbits/sec
>>>>> [  3]  0.0-10.0 sec  96.5 MBytes  80.9 Mbits/sec
>>>>>
>>>>> GOOD
>>>>>
>>>>> # iperf -t 10 -i 1 -c 192.168.1.129
>>>>> ------------------------------------------------------------
>>>>> Client connecting to 192.168.1.129, TCP port 5001
>>>>> TCP window size: 96.2 KByte (default)
>>>>> ------------------------------------------------------------
>>>>> [  3] local 192.168.1.12 port 54898 connected with 192.168.1.129 port 5001
>>>>> [ ID] Interval       Transfer     Bandwidth
>>>>> [  3]  0.0- 1.0 sec  11.2 MBytes  94.4 Mbits/sec
>>>>> [  3]  1.0- 2.0 sec  11.0 MBytes  92.3 Mbits/sec
>>>>> [  3]  2.0- 3.0 sec  10.8 MBytes  90.2 Mbits/sec
>>>>> [  3]  3.0- 4.0 sec  11.0 MBytes  92.3 Mbits/sec
>>>>> [  3]  4.0- 5.0 sec  10.9 MBytes  91.2 Mbits/sec
>>>>> [  3]  5.0- 6.0 sec  10.9 MBytes  91.2 Mbits/sec
>>>>> [  3]  6.0- 7.0 sec  10.8 MBytes  90.2 Mbits/sec
>>>>> [  3]  7.0- 8.0 sec  10.9 MBytes  91.2 Mbits/sec
>>>>> [  3]  8.0- 9.0 sec  10.9 MBytes  91.2 Mbits/sec
>>>>> [  3]  9.0-10.0 sec  10.9 MBytes  91.2 Mbits/sec
>>>>> [  3]  0.0-10.0 sec   109 MBytes  91.4 Mbits/sec
>>>>>
>>>>> We were able to bisect this down to this commit:
>>>>>
>>>>> first bad commit: [65466904b015f6eeb9225b51aeb29b01a1d4b59c] tcp: adjust
>>>>> TSO packet sizes based on min_rtt
>>>>>
>>>>> Disabling this new setting via:
>>>>>
>>>>> echo 0 > /proc/sys/net/ipv4/tcp_tso_rtt_log
>>>>>
>>>>> confirm that this was the cause of the performance regression.
>>>>>
>>>>> Is it expected that the new default setting has such a performance impact?
>>> Indeed, thanks for the report.
>>>
>>> In addition to the "ss" output Eric mentioned, could you please grab
>>> "nstat" output, which should allow us to calculate the average TSO/GSO
>>> and LRO/GRO burst sizes, which is the key thing tuned with the
>>> tcp_tso_rtt_log knob.
>>>
>>> So it would be great to have the following from both data sender and
>>> data receiver, for both the good case and bad case, if you could start
>>> these before your test and kill them after the test stops:
>>>
>>> (while true; do date; ss -tenmoi; sleep 1; done) > /root/ss.txt &
>>> nstat -n; (while true; do date; nstat; sleep 1; done)  > /root/nstat.txt
>> i upload everything here:
>> https://github.com/lategoodbye/tcp_tso_rtt_log_regress
>>
>> The server part is a Ubuntu installation connected to the internet. At
>> first i logged the good case, then i continued with the bad case.
>> Accidentally i delete a log file of bad case, so i repeated the whole
>> bad case again. So the uploaded bad case files are from the third run.
> Thanks for the detailed data!
>
> Here are some notes from looking at this data:
>
> + bad client: avg TSO burst size is roughly:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_bad.log
> IpOutRequests                   308               44.7
> IpExtOutOctets                  10050656        1403181.0
> est bytes   per TSO burst: 10050656 / 308 = 32632
> est packets per TSO burst: 32632 / 1448 ~= 22.5
>
> + good client: avg TSO burst size is roughly:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/nstat_client_good.log
> IpOutRequests                   529               62.0
> IpExtOutOctets                  11502992        1288711.5
> est bytes   per TSO burst: 11502992 / 529 ~= 21745
> est packets per TSO burst: 21745 / 1448 ~= 15.0
>
> + bad client ss data:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_bad.log
> State Recv-Q Send-Q Local Address:Port   Peer Address:PortProcess
> ESTAB 0      236024  192.168.1.12:39228 192.168.1.129:5001
> timer:(on,030ms,0) ino:25876 sk:414f52af rto:0.21 cwnd:68 ssthresh:20
> reordering:0
> Mbits/sec allowed by cwnd: 68 * 1448 * 8 / .0018 / 1000000.0 ~= 437.6
>
> + good client ss data:
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/blob/main/ss_client_good.log
> Fri Oct 13 15:04:36 CEST 2023
> State Recv-Q Send-Q Local Address:Port   Peer Address:PortProcess
> ESTAB 0      425712  192.168.1.12:33284 192.168.1.129:5001
> timer:(on,020ms,0) ino:20654 sk:414f52af rto:0.21 cwnd:106 ssthresh:20
> reordering:0
> Mbits/sec allowed by cwnd: 106 * 1448 * 8 / .0028 / 1000000.0 = 438.5
>
> So it seems indeed like cwnd is not the limiting factor, and instead
> there is something about the larger TSO/GSO bursts (roughly 22.5
> packets per burst on average) in the "bad" case that is causing
> problems, and preventing the sender from keeping the pipe fully
> utilized.
>
> So perhaps the details of the tcp_tso_should_defer() logic are hurting
> performance?
>
> The default value of tcp_tso_win_divisor is 3, and in the bad case the
> cwnd / tcp_tso_win_divisor = 68 / 3 = 22.7 packets, which is
> suspiciously close to the average TSO burst size of 22.5. So my guess
> is that the tcp_tso_win_divisor of 3 is the dominant factor here, and
> perhaps if we raise it to 5, then 68/5 ~= 13.60 will approximate the
> TSO burst size in the "good" case, and fully utilize the pipe. So it
> seems worth an experiment, to see what we can learn.
>
> To test that theory, could you please try running the following as
> root on the data sender machine, and then re-running the "bad" test
> with tcp_tso_rtt_log at the default value of 9?
>
>     sysctl net.ipv4.tcp_tso_win_divisor=5
Unfortunately this doesn't fix it.

Please look at the trace and sysctl settings [1]. I will try to figure
out what's wrong mit iproute2-ss later. CET says it's time to sleep.

[1] -
https://github.com/lategoodbye/tcp_tso_rtt_log_regress/commit/e1ceb689d7797eb10127613861d56cb3303f7b72
>
> Thanks!
> neal


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iperf performance regression since Linux 5.18
  2023-10-14 23:26           ` Stefan Wahren
@ 2023-10-15 10:23             ` Stefan Wahren
  2023-10-16  9:49               ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Wahren @ 2023-10-15 10:23 UTC (permalink / raw)
  To: Eric Dumazet, Neal Cardwell
  Cc: Jakub Kicinski, Fabio Estevam, linux-imx, Stefan Wahren,
	Michael Heimpold, netdev, Yuchung Cheng

Hi,

Am 15.10.23 um 01:26 schrieb Stefan Wahren:
> Hi Eric,
>
> Am 15.10.23 um 00:51 schrieb Eric Dumazet:
>> On Sat, Oct 14, 2023 at 9:40 PM Neal Cardwell <ncardwell@google.com>
>> wrote:
...
>> Hmm, we receive ~3200 acks per second, I am not sure the
>> tcp_tso_should_defer() logic
>> would hurt ?
>>
>> Also the ss binary on the client seems very old, or its output has
>> been mangled perhaps ?
> this binary is from Yocto kirkstone:
>
> # ss --version
> ss utility, iproute2-5.17.0
>
> This shouldn't be too old. Maybe some missing kernel settings?
>
i think i was able to fix the issue by enable the proper kernel
settings. I rerun initial bad and good case again and overwrote the log
files:

https://github.com/lategoodbye/tcp_tso_rtt_log_regress/commit/93615c94ba1bf36bd47cc2b91dd44a3f58c601bc

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iperf performance regression since Linux 5.18
  2023-10-15 10:23             ` Stefan Wahren
@ 2023-10-16  9:49               ` Eric Dumazet
  2023-10-16 10:35                 ` Eric Dumazet
  2023-10-16 18:21                 ` Stefan Wahren
  0 siblings, 2 replies; 20+ messages in thread
From: Eric Dumazet @ 2023-10-16  9:49 UTC (permalink / raw)
  To: Stefan Wahren
  Cc: Neal Cardwell, Jakub Kicinski, Fabio Estevam, linux-imx,
	Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng

On Sun, Oct 15, 2023 at 12:23 PM Stefan Wahren <wahrenst@gmx.net> wrote:
>
> Hi,
>
> Am 15.10.23 um 01:26 schrieb Stefan Wahren:
> > Hi Eric,
> >
> > Am 15.10.23 um 00:51 schrieb Eric Dumazet:
> >> On Sat, Oct 14, 2023 at 9:40 PM Neal Cardwell <ncardwell@google.com>
> >> wrote:
> ...
> >> Hmm, we receive ~3200 acks per second, I am not sure the
> >> tcp_tso_should_defer() logic
> >> would hurt ?
> >>
> >> Also the ss binary on the client seems very old, or its output has
> >> been mangled perhaps ?
> > this binary is from Yocto kirkstone:
> >
> > # ss --version
> > ss utility, iproute2-5.17.0
> >
> > This shouldn't be too old. Maybe some missing kernel settings?
> >
> i think i was able to fix the issue by enable the proper kernel
> settings. I rerun initial bad and good case again and overwrote the log
> files:
>
> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/commit/93615c94ba1bf36bd47cc2b91dd44a3f58c601bc

Excellent, thanks.

I see your kernel uses HZ=100, have you tried HZ=1000 by any chance ?

CONFIG_HZ_1000=y
CONFIG_HZ=1000

I see that the bad run seems to be stuck for a while with cwnd=66, but
a smaller amount of packets in flight (26 in following ss extract)

ESTAB 0 315664 192.168.1.12:60542 192.168.1.129:5001
timer:(on,030ms,0) ino:13011 sk:2 <->
skmem:(r0,rb131072,t48488,tb295680,f3696,w319888,o0,bl0,d0) ts sack
cubic wscale:7,6 rto:210 rtt:3.418/1.117 mss:1448 pmtu:1500 rcvmss:536
advmss:1448 cwnd:66 ssthresh:20 bytes_sent:43874400
bytes_acked:43836753 segs_out:30302 segs_in:14110 data_segs_out:30300
send 223681685bps lastsnd:10 lastrcv:4310 pacing_rate 268408200bps
delivery_rate 46336000bps delivered:30275 busy:4310ms unacked:26
rcv_space:14480 rcv_ssthresh:64088 notsent:278016 minrtt:0.744

I wonder if fec pseudo-tso code is adding some kind of artifacts,
maybe with TCP small queue logic.
(TX completion might be delayed too much on fec driver side)

Can you try

ethtool -K eth0 tso off ?

Alternatively I think I mentioned earlier that you could try to reduce
gso_max_size on a 100Mbit link

ip link set dev eth0 gso_max_size 16384

(Presumably a driver could do this tuning automatically)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iperf performance regression since Linux 5.18
  2023-10-16  9:49               ` Eric Dumazet
@ 2023-10-16 10:35                 ` Eric Dumazet
  2023-10-16 18:25                   ` Stefan Wahren
  2023-10-16 18:21                 ` Stefan Wahren
  1 sibling, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2023-10-16 10:35 UTC (permalink / raw)
  To: Stefan Wahren
  Cc: Neal Cardwell, Jakub Kicinski, Fabio Estevam, linux-imx,
	Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng

On Mon, Oct 16, 2023 at 11:49 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sun, Oct 15, 2023 at 12:23 PM Stefan Wahren <wahrenst@gmx.net> wrote:
> >
> > Hi,
> >
> > Am 15.10.23 um 01:26 schrieb Stefan Wahren:
> > > Hi Eric,
> > >
> > > Am 15.10.23 um 00:51 schrieb Eric Dumazet:
> > >> On Sat, Oct 14, 2023 at 9:40 PM Neal Cardwell <ncardwell@google.com>
> > >> wrote:
> > ...
> > >> Hmm, we receive ~3200 acks per second, I am not sure the
> > >> tcp_tso_should_defer() logic
> > >> would hurt ?
> > >>
> > >> Also the ss binary on the client seems very old, or its output has
> > >> been mangled perhaps ?
> > > this binary is from Yocto kirkstone:
> > >
> > > # ss --version
> > > ss utility, iproute2-5.17.0
> > >
> > > This shouldn't be too old. Maybe some missing kernel settings?
> > >
> > i think i was able to fix the issue by enable the proper kernel
> > settings. I rerun initial bad and good case again and overwrote the log
> > files:
> >
> > https://github.com/lategoodbye/tcp_tso_rtt_log_regress/commit/93615c94ba1bf36bd47cc2b91dd44a3f58c601bc
>
> Excellent, thanks.
>
> I see your kernel uses HZ=100, have you tried HZ=1000 by any chance ?
>
> CONFIG_HZ_1000=y
> CONFIG_HZ=1000
>
> I see that the bad run seems to be stuck for a while with cwnd=66, but
> a smaller amount of packets in flight (26 in following ss extract)
>
> ESTAB 0 315664 192.168.1.12:60542 192.168.1.129:5001
> timer:(on,030ms,0) ino:13011 sk:2 <->
> skmem:(r0,rb131072,t48488,tb295680,f3696,w319888,o0,bl0,d0) ts sack
> cubic wscale:7,6 rto:210 rtt:3.418/1.117 mss:1448 pmtu:1500 rcvmss:536
> advmss:1448 cwnd:66 ssthresh:20 bytes_sent:43874400
> bytes_acked:43836753 segs_out:30302 segs_in:14110 data_segs_out:30300
> send 223681685bps lastsnd:10 lastrcv:4310 pacing_rate 268408200bps
> delivery_rate 46336000bps delivered:30275 busy:4310ms unacked:26
> rcv_space:14480 rcv_ssthresh:64088 notsent:278016 minrtt:0.744
>
> I wonder if fec pseudo-tso code is adding some kind of artifacts,
> maybe with TCP small queue logic.
> (TX completion might be delayed too much on fec driver side)

Speaking of TSQ, it seems an old change (commit 75eefc6c59fd "tcp:
tsq: add a shortcut in tcp_small_queue_check()")
has been accidentally removed in 2017 (75c119afe14f "tcp: implement
rb-tree based retransmit queue")

Could you try this fix:

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 9c8c42c280b7638f0f4d94d68cd2c73e3c6c2bcc..e61a3a381d51b554ec8440928e22a290712f0b6b
100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2542,6 +2542,18 @@ static bool tcp_pacing_check(struct sock *sk)
        return true;
 }

+static bool tcp_rtx_queue_empty_or_single_skb(const struct sock *sk)
+{
+       const struct rb_node *node = sk->tcp_rtx_queue.rb_node;
+
+       /* No skb in the rtx queue. */
+       if (!node)
+               return true;
+
+       /* Only one skb in rtx queue. */
+       return !node->rb_left && !node->rb_right;
+}
+
 /* TCP Small Queues :
  * Control number of packets in qdisc/devices to two packets / or ~1 ms.
  * (These limits are doubled for retransmits)
@@ -2579,12 +2591,12 @@ static bool tcp_small_queue_check(struct sock
*sk, const struct sk_buff *skb,
                limit += extra_bytes;
        }
        if (refcount_read(&sk->sk_wmem_alloc) > limit) {
-               /* Always send skb if rtx queue is empty.
+               /* Always send skb if rtx queue is empty or has one skb.
                 * No need to wait for TX completion to call us back,
                 * after softirq/tasklet schedule.
                 * This helps when TX completions are delayed too much.
                 */
-               if (tcp_rtx_queue_empty(sk))
+               if (tcp_rtx_queue_empty_or_single_skb(sk))
                        return false;

                set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iperf performance regression since Linux 5.18
  2023-10-16  9:49               ` Eric Dumazet
  2023-10-16 10:35                 ` Eric Dumazet
@ 2023-10-16 18:21                 ` Stefan Wahren
  1 sibling, 0 replies; 20+ messages in thread
From: Stefan Wahren @ 2023-10-16 18:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, Jakub Kicinski, Fabio Estevam, linux-imx,
	Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng

Hi Eric,

Am 16.10.23 um 11:49 schrieb Eric Dumazet:
> On Sun, Oct 15, 2023 at 12:23 PM Stefan Wahren <wahrenst@gmx.net> wrote:
>> Hi,
>>
>> Am 15.10.23 um 01:26 schrieb Stefan Wahren:
>>> Hi Eric,
>>>
>>> Am 15.10.23 um 00:51 schrieb Eric Dumazet:
>>>> On Sat, Oct 14, 2023 at 9:40 PM Neal Cardwell <ncardwell@google.com>
>>>> wrote:
>> ...
>>>> Hmm, we receive ~3200 acks per second, I am not sure the
>>>> tcp_tso_should_defer() logic
>>>> would hurt ?
>>>>
>>>> Also the ss binary on the client seems very old, or its output has
>>>> been mangled perhaps ?
>>> this binary is from Yocto kirkstone:
>>>
>>> # ss --version
>>> ss utility, iproute2-5.17.0
>>>
>>> This shouldn't be too old. Maybe some missing kernel settings?
>>>
>> i think i was able to fix the issue by enable the proper kernel
>> settings. I rerun initial bad and good case again and overwrote the log
>> files:
>>
>> https://github.com/lategoodbye/tcp_tso_rtt_log_regress/commit/93615c94ba1bf36bd47cc2b91dd44a3f58c601bc
> Excellent, thanks.
>
> I see your kernel uses HZ=100, have you tried HZ=1000 by any chance ?
>
> CONFIG_HZ_1000=y
> CONFIG_HZ=1000
i tried, but it doesn't have any influence.
> I see that the bad run seems to be stuck for a while with cwnd=66, but
> a smaller amount of packets in flight (26 in following ss extract)
>
> ESTAB 0 315664 192.168.1.12:60542 192.168.1.129:5001
> timer:(on,030ms,0) ino:13011 sk:2 <->
> skmem:(r0,rb131072,t48488,tb295680,f3696,w319888,o0,bl0,d0) ts sack
> cubic wscale:7,6 rto:210 rtt:3.418/1.117 mss:1448 pmtu:1500 rcvmss:536
> advmss:1448 cwnd:66 ssthresh:20 bytes_sent:43874400
> bytes_acked:43836753 segs_out:30302 segs_in:14110 data_segs_out:30300
> send 223681685bps lastsnd:10 lastrcv:4310 pacing_rate 268408200bps
> delivery_rate 46336000bps delivered:30275 busy:4310ms unacked:26
> rcv_space:14480 rcv_ssthresh:64088 notsent:278016 minrtt:0.744
>
> I wonder if fec pseudo-tso code is adding some kind of artifacts,
> maybe with TCP small queue logic.
> (TX completion might be delayed too much on fec driver side)
>
> Can you try
>
> ethtool -K eth0 tso off ?
TSO off, CONFIG_HZ_100=y

root@tarragon:~# iperf -t 10 -i 1 -c 192.168.1.129
------------------------------------------------------------
Client connecting to 192.168.1.129, TCP port 5001
TCP window size:  122 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.12 port 52326 connected with 192.168.1.129 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  11.6 MBytes  97.5 Mbits/sec
[  3]  1.0- 2.0 sec  11.2 MBytes  94.4 Mbits/sec
[  3]  2.0- 3.0 sec  11.2 MBytes  94.4 Mbits/sec
[  3]  3.0- 4.0 sec  11.1 MBytes  93.3 Mbits/sec
[  3]  4.0- 5.0 sec  11.2 MBytes  94.4 Mbits/sec
[  3]  5.0- 6.0 sec  11.1 MBytes  93.3 Mbits/sec
[  3]  6.0- 7.0 sec  11.4 MBytes  95.4 Mbits/sec
[  3]  7.0- 8.0 sec  11.1 MBytes  93.3 Mbits/sec
[  3]  8.0- 9.0 sec  11.5 MBytes  96.5 Mbits/sec
[  3]  9.0-10.0 sec  11.1 MBytes  93.3 Mbits/sec
[  3]  0.0-10.0 sec   113 MBytes  94.5 Mbits/sec

The figures seems slightly better than tcp_tso_rtt_log = 0 -> Good
> Alternatively I think I mentioned earlier that you could try to reduce
> gso_max_size on a 100Mbit link
>
> ip link set dev eth0 gso_max_size 16384
TSO on, gso_max_size 16384, CONFIG_HZ_100=y

root@tarragon:~# iperf -t 10 -i 1 -c 192.168.1.129
------------------------------------------------------------
Client connecting to 192.168.1.129, TCP port 5001
TCP window size:  101 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.12 port 54548 connected with 192.168.1.129 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  11.4 MBytes  95.4 Mbits/sec
[  3]  1.0- 2.0 sec  11.0 MBytes  92.3 Mbits/sec
[  3]  2.0- 3.0 sec  10.9 MBytes  91.2 Mbits/sec
[  3]  3.0- 4.0 sec  11.0 MBytes  92.3 Mbits/sec
[  3]  4.0- 5.0 sec  10.6 MBytes  89.1 Mbits/sec
[  3]  5.0- 6.0 sec  11.0 MBytes  92.3 Mbits/sec
[  3]  6.0- 7.0 sec  10.9 MBytes  91.2 Mbits/sec
[  3]  7.0- 8.0 sec  11.0 MBytes  92.3 Mbits/sec
[  3]  8.0- 9.0 sec  10.8 MBytes  90.2 Mbits/sec
[  3]  9.0-10.0 sec  10.9 MBytes  91.2 Mbits/sec
[  3]  0.0-10.0 sec   109 MBytes  91.7 Mbits/sec

The figures are similiar to tcp_tso_rtt_log = 0 -> Good

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iperf performance regression since Linux 5.18
  2023-10-16 10:35                 ` Eric Dumazet
@ 2023-10-16 18:25                   ` Stefan Wahren
  2023-10-16 18:47                     ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Wahren @ 2023-10-16 18:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, Jakub Kicinski, Fabio Estevam, linux-imx,
	Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng

Hi Eric,

Am 16.10.23 um 12:35 schrieb Eric Dumazet:
> On Mon, Oct 16, 2023 at 11:49 AM Eric Dumazet <edumazet@google.com> wrote:
> Speaking of TSQ, it seems an old change (commit 75eefc6c59fd "tcp:
> tsq: add a shortcut in tcp_small_queue_check()")
> has been accidentally removed in 2017 (75c119afe14f "tcp: implement
> rb-tree based retransmit queue")
>
> Could you try this fix:
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 9c8c42c280b7638f0f4d94d68cd2c73e3c6c2bcc..e61a3a381d51b554ec8440928e22a290712f0b6b
> 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2542,6 +2542,18 @@ static bool tcp_pacing_check(struct sock *sk)
>          return true;
>   }
>
> +static bool tcp_rtx_queue_empty_or_single_skb(const struct sock *sk)
> +{
> +       const struct rb_node *node = sk->tcp_rtx_queue.rb_node;
> +
> +       /* No skb in the rtx queue. */
> +       if (!node)
> +               return true;
> +
> +       /* Only one skb in rtx queue. */
> +       return !node->rb_left && !node->rb_right;
> +}
> +
>   /* TCP Small Queues :
>    * Control number of packets in qdisc/devices to two packets / or ~1 ms.
>    * (These limits are doubled for retransmits)
> @@ -2579,12 +2591,12 @@ static bool tcp_small_queue_check(struct sock
> *sk, const struct sk_buff *skb,
>                  limit += extra_bytes;
>          }
>          if (refcount_read(&sk->sk_wmem_alloc) > limit) {
> -               /* Always send skb if rtx queue is empty.
> +               /* Always send skb if rtx queue is empty or has one skb.
>                   * No need to wait for TX completion to call us back,
>                   * after softirq/tasklet schedule.
>                   * This helps when TX completions are delayed too much.
>                   */
> -               if (tcp_rtx_queue_empty(sk))
> +               if (tcp_rtx_queue_empty_or_single_skb(sk))
>                          return false;
>
>                  set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);

This patch applied on top of Linux 6.1.49, TSO on, gso_max_size 65535,
CONFIG_HZ_100=y

root@tarragon:/boot# iperf -t 10 -i 1 -c 192.168.1.129
------------------------------------------------------------
Client connecting to 192.168.1.129, TCP port 5001
TCP window size:  192 KByte (default)
------------------------------------------------------------
[  3] local 192.168.1.12 port 59714 connected with 192.168.1.129 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  11.5 MBytes  96.5 Mbits/sec
[  3]  1.0- 2.0 sec  11.4 MBytes  95.4 Mbits/sec
[  3]  2.0- 3.0 sec  11.1 MBytes  93.3 Mbits/sec
[  3]  3.0- 4.0 sec  11.2 MBytes  94.4 Mbits/sec
[  3]  4.0- 5.0 sec  11.1 MBytes  93.3 Mbits/sec
[  3]  5.0- 6.0 sec  11.2 MBytes  94.4 Mbits/sec
[  3]  6.0- 7.0 sec  11.2 MBytes  94.4 Mbits/sec
[  3]  7.0- 8.0 sec  11.1 MBytes  93.3 Mbits/sec
[  3]  8.0- 9.0 sec  11.4 MBytes  95.4 Mbits/sec
[  3]  9.0-10.0 sec  11.2 MBytes  94.4 Mbits/sec
[  3]  0.0-10.0 sec   113 MBytes  94.4 Mbits/sec

The figures are comparable to disabling TSO -> Good

Thanks

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iperf performance regression since Linux 5.18
  2023-10-16 18:25                   ` Stefan Wahren
@ 2023-10-16 18:47                     ` Eric Dumazet
  2023-10-17  9:53                       ` Stefan Wahren
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2023-10-16 18:47 UTC (permalink / raw)
  To: Stefan Wahren
  Cc: Neal Cardwell, Jakub Kicinski, Fabio Estevam, linux-imx,
	Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng

On Mon, Oct 16, 2023 at 8:25 PM Stefan Wahren <wahrenst@gmx.net> wrote:
>
> Hi Eric,
>
> Am 16.10.23 um 12:35 schrieb Eric Dumazet:
> > On Mon, Oct 16, 2023 at 11:49 AM Eric Dumazet <edumazet@google.com> wrote:
> > Speaking of TSQ, it seems an old change (commit 75eefc6c59fd "tcp:
> > tsq: add a shortcut in tcp_small_queue_check()")
> > has been accidentally removed in 2017 (75c119afe14f "tcp: implement
> > rb-tree based retransmit queue")
> >
> > Could you try this fix:
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 9c8c42c280b7638f0f4d94d68cd2c73e3c6c2bcc..e61a3a381d51b554ec8440928e22a290712f0b6b
> > 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -2542,6 +2542,18 @@ static bool tcp_pacing_check(struct sock *sk)
> >          return true;
> >   }
> >
> > +static bool tcp_rtx_queue_empty_or_single_skb(const struct sock *sk)
> > +{
> > +       const struct rb_node *node = sk->tcp_rtx_queue.rb_node;
> > +
> > +       /* No skb in the rtx queue. */
> > +       if (!node)
> > +               return true;
> > +
> > +       /* Only one skb in rtx queue. */
> > +       return !node->rb_left && !node->rb_right;
> > +}
> > +
> >   /* TCP Small Queues :
> >    * Control number of packets in qdisc/devices to two packets / or ~1 ms.
> >    * (These limits are doubled for retransmits)
> > @@ -2579,12 +2591,12 @@ static bool tcp_small_queue_check(struct sock
> > *sk, const struct sk_buff *skb,
> >                  limit += extra_bytes;
> >          }
> >          if (refcount_read(&sk->sk_wmem_alloc) > limit) {
> > -               /* Always send skb if rtx queue is empty.
> > +               /* Always send skb if rtx queue is empty or has one skb.
> >                   * No need to wait for TX completion to call us back,
> >                   * after softirq/tasklet schedule.
> >                   * This helps when TX completions are delayed too much.
> >                   */
> > -               if (tcp_rtx_queue_empty(sk))
> > +               if (tcp_rtx_queue_empty_or_single_skb(sk))
> >                          return false;
> >
> >                  set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);
>
> This patch applied on top of Linux 6.1.49, TSO on, gso_max_size 65535,
> CONFIG_HZ_100=y
>
> root@tarragon:/boot# iperf -t 10 -i 1 -c 192.168.1.129
> ------------------------------------------------------------
> Client connecting to 192.168.1.129, TCP port 5001
> TCP window size:  192 KByte (default)
> ------------------------------------------------------------
> [  3] local 192.168.1.12 port 59714 connected with 192.168.1.129 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0- 1.0 sec  11.5 MBytes  96.5 Mbits/sec
> [  3]  1.0- 2.0 sec  11.4 MBytes  95.4 Mbits/sec
> [  3]  2.0- 3.0 sec  11.1 MBytes  93.3 Mbits/sec
> [  3]  3.0- 4.0 sec  11.2 MBytes  94.4 Mbits/sec
> [  3]  4.0- 5.0 sec  11.1 MBytes  93.3 Mbits/sec
> [  3]  5.0- 6.0 sec  11.2 MBytes  94.4 Mbits/sec
> [  3]  6.0- 7.0 sec  11.2 MBytes  94.4 Mbits/sec
> [  3]  7.0- 8.0 sec  11.1 MBytes  93.3 Mbits/sec
> [  3]  8.0- 9.0 sec  11.4 MBytes  95.4 Mbits/sec
> [  3]  9.0-10.0 sec  11.2 MBytes  94.4 Mbits/sec
> [  3]  0.0-10.0 sec   113 MBytes  94.4 Mbits/sec
>
> The figures are comparable to disabling TSO -> Good
>
> Thanks

Great. I suspect a very slow TX completion from fec then.

Could you use the following bpftrace program while your iperf is running ?

.bpftrace -e '
k:__dev_queue_xmit {
 $skb = (struct sk_buff *)arg0;
 if ($skb->fclone == 2) {
  @start[$skb] = nsecs;
 }
}
k:__kfree_skb {
 $skb = (struct sk_buff *)arg0;
 if ($skb->fclone == 2 && @start[$skb]) {
  @tx_compl_usecs = hist((nsecs - @start[$skb])/1000);
  delete(@start[$skb]);
 }
}
END { clear(@start); }'

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iperf performance regression since Linux 5.18
  2023-10-16 18:47                     ` Eric Dumazet
@ 2023-10-17  9:53                       ` Stefan Wahren
  2023-10-17 12:08                         ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Stefan Wahren @ 2023-10-17  9:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, Jakub Kicinski, Fabio Estevam, linux-imx,
	Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng

Hi Eric,

Am 16.10.23 um 20:47 schrieb Eric Dumazet:
> On Mon, Oct 16, 2023 at 8:25 PM Stefan Wahren <wahrenst@gmx.net> wrote:
>> Hi Eric,
>>
>> Am 16.10.23 um 12:35 schrieb Eric Dumazet:
>>> On Mon, Oct 16, 2023 at 11:49 AM Eric Dumazet <edumazet@google.com> wrote:
>>> Speaking of TSQ, it seems an old change (commit 75eefc6c59fd "tcp:
>>> tsq: add a shortcut in tcp_small_queue_check()")
>>> has been accidentally removed in 2017 (75c119afe14f "tcp: implement
>>> rb-tree based retransmit queue")
>>>
>>> Could you try this fix:
>>>
>>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>>> index 9c8c42c280b7638f0f4d94d68cd2c73e3c6c2bcc..e61a3a381d51b554ec8440928e22a290712f0b6b
>>> 100644
>>> --- a/net/ipv4/tcp_output.c
>>> +++ b/net/ipv4/tcp_output.c
>>> @@ -2542,6 +2542,18 @@ static bool tcp_pacing_check(struct sock *sk)
>>>           return true;
>>>    }
>>>
>>> +static bool tcp_rtx_queue_empty_or_single_skb(const struct sock *sk)
>>> +{
>>> +       const struct rb_node *node = sk->tcp_rtx_queue.rb_node;
>>> +
>>> +       /* No skb in the rtx queue. */
>>> +       if (!node)
>>> +               return true;
>>> +
>>> +       /* Only one skb in rtx queue. */
>>> +       return !node->rb_left && !node->rb_right;
>>> +}
>>> +
>>>    /* TCP Small Queues :
>>>     * Control number of packets in qdisc/devices to two packets / or ~1 ms.
>>>     * (These limits are doubled for retransmits)
>>> @@ -2579,12 +2591,12 @@ static bool tcp_small_queue_check(struct sock
>>> *sk, const struct sk_buff *skb,
>>>                   limit += extra_bytes;
>>>           }
>>>           if (refcount_read(&sk->sk_wmem_alloc) > limit) {
>>> -               /* Always send skb if rtx queue is empty.
>>> +               /* Always send skb if rtx queue is empty or has one skb.
>>>                    * No need to wait for TX completion to call us back,
>>>                    * after softirq/tasklet schedule.
>>>                    * This helps when TX completions are delayed too much.
>>>                    */
>>> -               if (tcp_rtx_queue_empty(sk))
>>> +               if (tcp_rtx_queue_empty_or_single_skb(sk))
>>>                           return false;
>>>
>>>                   set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);
>> This patch applied on top of Linux 6.1.49, TSO on, gso_max_size 65535,
>> CONFIG_HZ_100=y
>>
>> root@tarragon:/boot# iperf -t 10 -i 1 -c 192.168.1.129
>> ------------------------------------------------------------
>> Client connecting to 192.168.1.129, TCP port 5001
>> TCP window size:  192 KByte (default)
>> ------------------------------------------------------------
>> [  3] local 192.168.1.12 port 59714 connected with 192.168.1.129 port 5001
>> [ ID] Interval       Transfer     Bandwidth
>> [  3]  0.0- 1.0 sec  11.5 MBytes  96.5 Mbits/sec
>> [  3]  1.0- 2.0 sec  11.4 MBytes  95.4 Mbits/sec
>> [  3]  2.0- 3.0 sec  11.1 MBytes  93.3 Mbits/sec
>> [  3]  3.0- 4.0 sec  11.2 MBytes  94.4 Mbits/sec
>> [  3]  4.0- 5.0 sec  11.1 MBytes  93.3 Mbits/sec
>> [  3]  5.0- 6.0 sec  11.2 MBytes  94.4 Mbits/sec
>> [  3]  6.0- 7.0 sec  11.2 MBytes  94.4 Mbits/sec
>> [  3]  7.0- 8.0 sec  11.1 MBytes  93.3 Mbits/sec
>> [  3]  8.0- 9.0 sec  11.4 MBytes  95.4 Mbits/sec
>> [  3]  9.0-10.0 sec  11.2 MBytes  94.4 Mbits/sec
>> [  3]  0.0-10.0 sec   113 MBytes  94.4 Mbits/sec
>>
>> The figures are comparable to disabling TSO -> Good
>>
>> Thanks
> Great. I suspect a very slow TX completion from fec then.
>
> Could you use the following bpftrace program while your iperf is running ?
unfortuntely there is no bpftrace and most of its dependencies on my
platform. I looked at some guides and it seems to have a lot of (build)
dependencies. On a PC this won't be a problem, but on my ARM platform
there is only 1 GB eMMC space left.

Before investing a lot of time to get bpftrace running, is there an
alternative solution?
>
> .bpftrace -e '
> k:__dev_queue_xmit {
>   $skb = (struct sk_buff *)arg0;
>   if ($skb->fclone == 2) {
>    @start[$skb] = nsecs;
>   }
> }
> k:__kfree_skb {
>   $skb = (struct sk_buff *)arg0;
>   if ($skb->fclone == 2 && @start[$skb]) {
>    @tx_compl_usecs = hist((nsecs - @start[$skb])/1000);
>    delete(@start[$skb]);
>   }
> }
> END { clear(@start); }'


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iperf performance regression since Linux 5.18
  2023-10-17  9:53                       ` Stefan Wahren
@ 2023-10-17 12:08                         ` Eric Dumazet
  2023-10-17 12:17                           ` Stefan Wahren
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2023-10-17 12:08 UTC (permalink / raw)
  To: Stefan Wahren
  Cc: Neal Cardwell, Jakub Kicinski, Fabio Estevam, linux-imx,
	Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng

On Tue, Oct 17, 2023 at 11:53 AM Stefan Wahren <wahrenst@gmx.net> wrote:
>
> Hi Eric,
>
> Am 16.10.23 um 20:47 schrieb Eric Dumazet:
> > On Mon, Oct 16, 2023 at 8:25 PM Stefan Wahren <wahrenst@gmx.net> wrote:
> >> Hi Eric,
> >>
> >> Am 16.10.23 um 12:35 schrieb Eric Dumazet:
> >>> On Mon, Oct 16, 2023 at 11:49 AM Eric Dumazet <edumazet@google.com> wrote:
> >>> Speaking of TSQ, it seems an old change (commit 75eefc6c59fd "tcp:
> >>> tsq: add a shortcut in tcp_small_queue_check()")
> >>> has been accidentally removed in 2017 (75c119afe14f "tcp: implement
> >>> rb-tree based retransmit queue")
> >>>
> >>> Could you try this fix:
> >>>
> >>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> >>> index 9c8c42c280b7638f0f4d94d68cd2c73e3c6c2bcc..e61a3a381d51b554ec8440928e22a290712f0b6b
> >>> 100644
> >>> --- a/net/ipv4/tcp_output.c
> >>> +++ b/net/ipv4/tcp_output.c
> >>> @@ -2542,6 +2542,18 @@ static bool tcp_pacing_check(struct sock *sk)
> >>>           return true;
> >>>    }
> >>>
> >>> +static bool tcp_rtx_queue_empty_or_single_skb(const struct sock *sk)
> >>> +{
> >>> +       const struct rb_node *node = sk->tcp_rtx_queue.rb_node;
> >>> +
> >>> +       /* No skb in the rtx queue. */
> >>> +       if (!node)
> >>> +               return true;
> >>> +
> >>> +       /* Only one skb in rtx queue. */
> >>> +       return !node->rb_left && !node->rb_right;
> >>> +}
> >>> +
> >>>    /* TCP Small Queues :
> >>>     * Control number of packets in qdisc/devices to two packets / or ~1 ms.
> >>>     * (These limits are doubled for retransmits)
> >>> @@ -2579,12 +2591,12 @@ static bool tcp_small_queue_check(struct sock
> >>> *sk, const struct sk_buff *skb,
> >>>                   limit += extra_bytes;
> >>>           }
> >>>           if (refcount_read(&sk->sk_wmem_alloc) > limit) {
> >>> -               /* Always send skb if rtx queue is empty.
> >>> +               /* Always send skb if rtx queue is empty or has one skb.
> >>>                    * No need to wait for TX completion to call us back,
> >>>                    * after softirq/tasklet schedule.
> >>>                    * This helps when TX completions are delayed too much.
> >>>                    */
> >>> -               if (tcp_rtx_queue_empty(sk))
> >>> +               if (tcp_rtx_queue_empty_or_single_skb(sk))
> >>>                           return false;
> >>>
> >>>                   set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);
> >> This patch applied on top of Linux 6.1.49, TSO on, gso_max_size 65535,
> >> CONFIG_HZ_100=y
> >>
> >> root@tarragon:/boot# iperf -t 10 -i 1 -c 192.168.1.129
> >> ------------------------------------------------------------
> >> Client connecting to 192.168.1.129, TCP port 5001
> >> TCP window size:  192 KByte (default)
> >> ------------------------------------------------------------
> >> [  3] local 192.168.1.12 port 59714 connected with 192.168.1.129 port 5001
> >> [ ID] Interval       Transfer     Bandwidth
> >> [  3]  0.0- 1.0 sec  11.5 MBytes  96.5 Mbits/sec
> >> [  3]  1.0- 2.0 sec  11.4 MBytes  95.4 Mbits/sec
> >> [  3]  2.0- 3.0 sec  11.1 MBytes  93.3 Mbits/sec
> >> [  3]  3.0- 4.0 sec  11.2 MBytes  94.4 Mbits/sec
> >> [  3]  4.0- 5.0 sec  11.1 MBytes  93.3 Mbits/sec
> >> [  3]  5.0- 6.0 sec  11.2 MBytes  94.4 Mbits/sec
> >> [  3]  6.0- 7.0 sec  11.2 MBytes  94.4 Mbits/sec
> >> [  3]  7.0- 8.0 sec  11.1 MBytes  93.3 Mbits/sec
> >> [  3]  8.0- 9.0 sec  11.4 MBytes  95.4 Mbits/sec
> >> [  3]  9.0-10.0 sec  11.2 MBytes  94.4 Mbits/sec
> >> [  3]  0.0-10.0 sec   113 MBytes  94.4 Mbits/sec
> >>
> >> The figures are comparable to disabling TSO -> Good
> >>
> >> Thanks
> > Great. I suspect a very slow TX completion from fec then.
> >
> > Could you use the following bpftrace program while your iperf is running ?
> unfortuntely there is no bpftrace and most of its dependencies on my
> platform. I looked at some guides and it seems to have a lot of (build)
> dependencies. On a PC this won't be a problem, but on my ARM platform
> there is only 1 GB eMMC space left.
>
> Before investing a lot of time to get bpftrace running, is there an
> alternative solution?

No worries Stefan, I think we do not have to get precise numbers, I
will try to provide a debug patch (since you are able to build custom
kernels)

In the meantime, I will submit the official TCP patch, as you said it
was helping a lot.

Just to confirm, have you tried the patch on top of the latest net tree ?

Thanks.

> >
> > .bpftrace -e '
> > k:__dev_queue_xmit {
> >   $skb = (struct sk_buff *)arg0;
> >   if ($skb->fclone == 2) {
> >    @start[$skb] = nsecs;
> >   }
> > }
> > k:__kfree_skb {
> >   $skb = (struct sk_buff *)arg0;
> >   if ($skb->fclone == 2 && @start[$skb]) {
> >    @tx_compl_usecs = hist((nsecs - @start[$skb])/1000);
> >    delete(@start[$skb]);
> >   }
> > }
> > END { clear(@start); }'
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: iperf performance regression since Linux 5.18
  2023-10-17 12:08                         ` Eric Dumazet
@ 2023-10-17 12:17                           ` Stefan Wahren
  0 siblings, 0 replies; 20+ messages in thread
From: Stefan Wahren @ 2023-10-17 12:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, Jakub Kicinski, Fabio Estevam, linux-imx,
	Stefan Wahren, Michael Heimpold, netdev, Yuchung Cheng


Am 17.10.23 um 14:08 schrieb Eric Dumazet:
> On Tue, Oct 17, 2023 at 11:53 AM Stefan Wahren <wahrenst@gmx.net> wrote:
>> Hi Eric,
>>
>> Am 16.10.23 um 20:47 schrieb Eric Dumazet:
>>> On Mon, Oct 16, 2023 at 8:25 PM Stefan Wahren <wahrenst@gmx.net> wrote:
>>>> Hi Eric,
>>>>
>>>> Am 16.10.23 um 12:35 schrieb Eric Dumazet:
>>>>> On Mon, Oct 16, 2023 at 11:49 AM Eric Dumazet <edumazet@google.com> wrote:
>>>>> Speaking of TSQ, it seems an old change (commit 75eefc6c59fd "tcp:
>>>>> tsq: add a shortcut in tcp_small_queue_check()")
>>>>> has been accidentally removed in 2017 (75c119afe14f "tcp: implement
>>>>> rb-tree based retransmit queue")
>>>>>
>>>>> Could you try this fix:
>>>>>
>>>>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>>>>> index 9c8c42c280b7638f0f4d94d68cd2c73e3c6c2bcc..e61a3a381d51b554ec8440928e22a290712f0b6b
>>>>> 100644
>>>>> --- a/net/ipv4/tcp_output.c
>>>>> +++ b/net/ipv4/tcp_output.c
>>>>> @@ -2542,6 +2542,18 @@ static bool tcp_pacing_check(struct sock *sk)
>>>>>            return true;
>>>>>     }
>>>>>
>>>>> +static bool tcp_rtx_queue_empty_or_single_skb(const struct sock *sk)
>>>>> +{
>>>>> +       const struct rb_node *node = sk->tcp_rtx_queue.rb_node;
>>>>> +
>>>>> +       /* No skb in the rtx queue. */
>>>>> +       if (!node)
>>>>> +               return true;
>>>>> +
>>>>> +       /* Only one skb in rtx queue. */
>>>>> +       return !node->rb_left && !node->rb_right;
>>>>> +}
>>>>> +
>>>>>     /* TCP Small Queues :
>>>>>      * Control number of packets in qdisc/devices to two packets / or ~1 ms.
>>>>>      * (These limits are doubled for retransmits)
>>>>> @@ -2579,12 +2591,12 @@ static bool tcp_small_queue_check(struct sock
>>>>> *sk, const struct sk_buff *skb,
>>>>>                    limit += extra_bytes;
>>>>>            }
>>>>>            if (refcount_read(&sk->sk_wmem_alloc) > limit) {
>>>>> -               /* Always send skb if rtx queue is empty.
>>>>> +               /* Always send skb if rtx queue is empty or has one skb.
>>>>>                     * No need to wait for TX completion to call us back,
>>>>>                     * after softirq/tasklet schedule.
>>>>>                     * This helps when TX completions are delayed too much.
>>>>>                     */
>>>>> -               if (tcp_rtx_queue_empty(sk))
>>>>> +               if (tcp_rtx_queue_empty_or_single_skb(sk))
>>>>>                            return false;
>>>>>
>>>>>                    set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);
>>>> This patch applied on top of Linux 6.1.49, TSO on, gso_max_size 65535,
>>>> CONFIG_HZ_100=y
>>>>
>>>> root@tarragon:/boot# iperf -t 10 -i 1 -c 192.168.1.129
>>>> ------------------------------------------------------------
>>>> Client connecting to 192.168.1.129, TCP port 5001
>>>> TCP window size:  192 KByte (default)
>>>> ------------------------------------------------------------
>>>> [  3] local 192.168.1.12 port 59714 connected with 192.168.1.129 port 5001
>>>> [ ID] Interval       Transfer     Bandwidth
>>>> [  3]  0.0- 1.0 sec  11.5 MBytes  96.5 Mbits/sec
>>>> [  3]  1.0- 2.0 sec  11.4 MBytes  95.4 Mbits/sec
>>>> [  3]  2.0- 3.0 sec  11.1 MBytes  93.3 Mbits/sec
>>>> [  3]  3.0- 4.0 sec  11.2 MBytes  94.4 Mbits/sec
>>>> [  3]  4.0- 5.0 sec  11.1 MBytes  93.3 Mbits/sec
>>>> [  3]  5.0- 6.0 sec  11.2 MBytes  94.4 Mbits/sec
>>>> [  3]  6.0- 7.0 sec  11.2 MBytes  94.4 Mbits/sec
>>>> [  3]  7.0- 8.0 sec  11.1 MBytes  93.3 Mbits/sec
>>>> [  3]  8.0- 9.0 sec  11.4 MBytes  95.4 Mbits/sec
>>>> [  3]  9.0-10.0 sec  11.2 MBytes  94.4 Mbits/sec
>>>> [  3]  0.0-10.0 sec   113 MBytes  94.4 Mbits/sec
>>>>
>>>> The figures are comparable to disabling TSO -> Good
>>>>
>>>> Thanks
>>> Great. I suspect a very slow TX completion from fec then.
>>>
>>> Could you use the following bpftrace program while your iperf is running ?
>> unfortuntely there is no bpftrace and most of its dependencies on my
>> platform. I looked at some guides and it seems to have a lot of (build)
>> dependencies. On a PC this won't be a problem, but on my ARM platform
>> there is only 1 GB eMMC space left.
>>
>> Before investing a lot of time to get bpftrace running, is there an
>> alternative solution?
> No worries Stefan, I think we do not have to get precise numbers, I
> will try to provide a debug patch (since you are able to build custom
> kernels)

Great, this would be much easier for me.

>
> In the meantime, I will submit the official TCP patch, as you said it
> was helping a lot.
>
> Just to confirm, have you tried the patch on top of the latest net tree ?

Not yet (only Linux 6.1.49 stable), but i can do this today and come
back to you.

>
> Thanks.
>
>>> .bpftrace -e '
>>> k:__dev_queue_xmit {
>>>    $skb = (struct sk_buff *)arg0;
>>>    if ($skb->fclone == 2) {
>>>     @start[$skb] = nsecs;
>>>    }
>>> }
>>> k:__kfree_skb {
>>>    $skb = (struct sk_buff *)arg0;
>>>    if ($skb->fclone == 2 && @start[$skb]) {
>>>     @tx_compl_usecs = hist((nsecs - @start[$skb])/1000);
>>>     delete(@start[$skb]);
>>>    }
>>> }
>>> END { clear(@start); }'


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-10-17 12:17 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-09 18:58 iperf performance regression since Linux 5.18 Stefan Wahren
2023-10-09 19:10 ` Eric Dumazet
2023-10-09 19:19   ` Neal Cardwell
2023-10-13 13:37     ` Stefan Wahren
2023-10-14 19:40       ` Neal Cardwell
2023-10-14 21:16         ` Dave Taht
2023-10-14 22:51         ` Eric Dumazet
2023-10-14 23:24           ` Neal Cardwell
2023-10-14 23:26           ` Stefan Wahren
2023-10-15 10:23             ` Stefan Wahren
2023-10-16  9:49               ` Eric Dumazet
2023-10-16 10:35                 ` Eric Dumazet
2023-10-16 18:25                   ` Stefan Wahren
2023-10-16 18:47                     ` Eric Dumazet
2023-10-17  9:53                       ` Stefan Wahren
2023-10-17 12:08                         ` Eric Dumazet
2023-10-17 12:17                           ` Stefan Wahren
2023-10-16 18:21                 ` Stefan Wahren
2023-10-15  0:06         ` Stefan Wahren
2023-10-11 12:58   ` Stefan Wahren

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).