performance regression on HiperSockets depending on MTU size

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* performance regression on HiperSockets depending on MTU size
@ 2012-11-26 15:32 Frank Blaschka
  2012-11-26 16:12 ` Eric Dumazet
  0 siblings, 1 reply; 5+ messages in thread
From: Frank Blaschka @ 2012-11-26 15:32 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, linux-s390

Hi Eric,

since kernel 3.6 we see a massive performance regression on s390
HiperSockets devices.

HiperSockets differ from normal devices by the fact they support
large MTU sizes (up to 56K). Here are some iperf numbers to show
the problem depended on MTU size:

# ifconfig hsi0 mtu 1500
# iperf -c 10.42.49.2
------------------------------------------------------------
Client connecting to 10.42.49.2, TCP port 5001
TCP window size: 47.6 KByte (default)
------------------------------------------------------------
[  3] local 10.42.49.1 port 55855 connected with 10.42.49.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec    632 MBytes    530 Mbits/sec

# ifconfig hsi0 mtu 9000
# iperf -c 10.42.49.2
------------------------------------------------------------
Client connecting to 10.42.49.2, TCP port 5001
TCP window size: 97.0 KByte (default)
------------------------------------------------------------
[  3] local 10.42.49.1 port 55856 connected with 10.42.49.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  2.26 GBytes  1.94 Gbits/sec

# ifconfig hsi0 mtu 32000
# iperf -c 10.42.49.2
------------------------------------------------------------
Client connecting to 10.42.49.2, TCP port 5001
TCP window size:   322 KByte (default)
------------------------------------------------------------
[  3] local 10.42.49.1 port 55857 connected with 10.42.49.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.3 sec  3.12 MBytes  2.53 Mbits/sec

Prior the regression throughput grows with the MTU size but now it drops
to a few Mbits if the MTU is bigger then 15000. It is interesting to see
if 2 or more connections are running in parallel the regression is gone.

# ifconfig hsi0 mtu 32000
# iperf -c 10.42.49.2 -P2
------------------------------------------------------------
Client connecting to 10.42.49.2, TCP port 5001
TCP window size:   322 KByte (default)
------------------------------------------------------------
[  4] local 10.42.49.1 port 55869 connected with 10.42.49.2 port 5001
[  3] local 10.42.49.1 port 55868 connected with 10.42.49.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  2.19 GBytes  1.88 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  2.17 GBytes  1.87 Gbits/sec
[SUM]  0.0-10.0 sec  4.36 GBytes  3.75 Gbits/sec

I bisected the problem to following patch:

commit 46d3ceabd8d98ed0ad10f20c595ca784e34786c5
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Wed Jul 11 05:50:31 2012 +0000

    tcp: TCP Small Queues

    This introduce TSQ (TCP Small Queues)

    TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
    device queues), to reduce RTT and cwnd bias, part of the bufferbloat
    problem.

Changing sysctl net.ipv4.tcp_limit_output_bytes to a higher value
(e.g. 640000) seems to fix the problem.

How does MTU influence/effects TSQ?
Why is the problem gone if there are more connections?
Do you see any drawbacks by increasing net.ipv4.tcp_limit_output_bytes?
Finally is this expected behavior or is there a bug depending on the big
MTU? What can I do to check ... ?

Thx for your help

Frank

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: performance regression on HiperSockets depending on MTU size
  2012-11-26 15:32 performance regression on HiperSockets depending on MTU size Frank Blaschka
@ 2012-11-26 16:12 ` Eric Dumazet
  2012-11-27  6:21   ` Cong Wang
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Dumazet @ 2012-11-26 16:12 UTC (permalink / raw)
  To: Frank Blaschka; +Cc: netdev, linux-s390

On Mon, 2012-11-26 at 16:32 +0100, Frank Blaschka wrote:
> Hi Eric,
> 
> since kernel 3.6 we see a massive performance regression on s390
> HiperSockets devices.
> 
> HiperSockets differ from normal devices by the fact they support
> large MTU sizes (up to 56K). Here are some iperf numbers to show
> the problem depended on MTU size:
> 
> # ifconfig hsi0 mtu 1500
> # iperf -c 10.42.49.2
> ------------------------------------------------------------
> Client connecting to 10.42.49.2, TCP port 5001
> TCP window size: 47.6 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.42.49.1 port 55855 connected with 10.42.49.2 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec    632 MBytes    530 Mbits/sec
> 
> # ifconfig hsi0 mtu 9000
> # iperf -c 10.42.49.2
> ------------------------------------------------------------
> Client connecting to 10.42.49.2, TCP port 5001
> TCP window size: 97.0 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.42.49.1 port 55856 connected with 10.42.49.2 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec  2.26 GBytes  1.94 Gbits/sec
> 
> # ifconfig hsi0 mtu 32000
> # iperf -c 10.42.49.2
> ------------------------------------------------------------
> Client connecting to 10.42.49.2, TCP port 5001
> TCP window size:   322 KByte (default)
> ------------------------------------------------------------
> [  3] local 10.42.49.1 port 55857 connected with 10.42.49.2 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.3 sec  3.12 MBytes  2.53 Mbits/sec
> 
> Prior the regression throughput grows with the MTU size but now it drops
> to a few Mbits if the MTU is bigger then 15000. It is interesting to see
> if 2 or more connections are running in parallel the regression is gone.
> 
> # ifconfig hsi0 mtu 32000
> # iperf -c 10.42.49.2 -P2
> ------------------------------------------------------------
> Client connecting to 10.42.49.2, TCP port 5001
> TCP window size:   322 KByte (default)
> ------------------------------------------------------------
> [  4] local 10.42.49.1 port 55869 connected with 10.42.49.2 port 5001
> [  3] local 10.42.49.1 port 55868 connected with 10.42.49.2 port 5001
> [ ID] Interval       Transfer     Bandwidth
> [  4]  0.0-10.0 sec  2.19 GBytes  1.88 Gbits/sec
> [ ID] Interval       Transfer     Bandwidth
> [  3]  0.0-10.0 sec  2.17 GBytes  1.87 Gbits/sec
> [SUM]  0.0-10.0 sec  4.36 GBytes  3.75 Gbits/sec
> 
> I bisected the problem to following patch:
> 
> commit 46d3ceabd8d98ed0ad10f20c595ca784e34786c5
> Author: Eric Dumazet <eric.dumazet@gmail.com>
> Date:   Wed Jul 11 05:50:31 2012 +0000
> 
>     tcp: TCP Small Queues
> 
>     This introduce TSQ (TCP Small Queues)
> 
>     TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
>     device queues), to reduce RTT and cwnd bias, part of the bufferbloat
>     problem.
> 
> Changing sysctl net.ipv4.tcp_limit_output_bytes to a higher value
> (e.g. 640000) seems to fix the problem.
> 
> How does MTU influence/effects TSQ?
> Why is the problem gone if there are more connections?
> Do you see any drawbacks by increasing net.ipv4.tcp_limit_output_bytes?
> Finally is this expected behavior or is there a bug depending on the big
> MTU? What can I do to check ... ?
> 

Hi Frank, thanks for this report.

You could tweak tcp_limit_output_bytes, but IMO the root of the problem
is in the driver itself.

For example, I had to change mlx4 driver for the same problem : Make
sure a TX packet can be "TX completed" in a short amount of time.

In the case of mlx4, the wait time was 128 us, but I suspect on your
case its more like an infinite time or several ms.
 
The driver is delaying the free of TX skb by a fixed amount of time,
or relies on following transmits to perform the TX completion


Check for an example :

commit ecfd2ce1a9d5e6376ff5c00b366345160abdbbb7
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 5 16:20:42 2012 +0000

    mlx4: change TX coalescing defaults
    
    mlx4 currently uses a too high tx coalescing setting, deferring
    TX completion interrupts by up to 128 us.
    
    With the recent skb_orphan() removal in commit 8112ec3b872,
    performance of a single TCP flow is capped to ~4 Gbps, unless
    we increase tcp_limit_output_bytes.
    
    I suggest using 16 us instead of 128 us, allowing a finer control.
    
    Performance of a single TCP flow is restored to previous levels,
    while keeping TCP small queues fully enabled with default sysctl.
    
    This patch is also a BQL prereq.
    
    Reported-by: Vimalkumar <j.vimal@gmail.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: performance regression on HiperSockets depending on MTU size
  2012-11-26 16:12 ` Eric Dumazet
@ 2012-11-27  6:21   ` Cong Wang
  2012-11-27  6:46     ` Eric Dumazet
  0 siblings, 1 reply; 5+ messages in thread
From: Cong Wang @ 2012-11-27  6:21 UTC (permalink / raw)
  To: netdev

On Mon, 26 Nov 2012 at 16:12 GMT, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Hi Frank, thanks for this report.
>
> You could tweak tcp_limit_output_bytes, but IMO the root of the problem
> is in the driver itself.
>
> For example, I had to change mlx4 driver for the same problem : Make
> sure a TX packet can be "TX completed" in a short amount of time.
>
> In the case of mlx4, the wait time was 128 us, but I suspect on your
> case its more like an infinite time or several ms.
>  
> The driver is delaying the free of TX skb by a fixed amount of time,
> or relies on following transmits to perform the TX completion
>

Eric,

Do you have a full list of such commits? I am trying to backport TSQ
to 2.6.32, and of course I don't want to miss these commits either.

Thanks!

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: performance regression on HiperSockets depending on MTU size
  2012-11-27  6:21   ` Cong Wang
@ 2012-11-27  6:46     ` Eric Dumazet
  2012-11-28  5:31       ` Cong Wang
  0 siblings, 1 reply; 5+ messages in thread
From: Eric Dumazet @ 2012-11-27  6:46 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev

On Tue, 2012-11-27 at 06:21 +0000, Cong Wang wrote:

> Eric,
> 
> Do you have a full list of such commits? I am trying to backport TSQ
> to 2.6.32, and of course I don't want to miss these commits either.

I dont think there are other known issues.

mlx4 had a 'problem' because only recently we removed the skb_orphan()
call it used to do in its start_xmit() function.

I remember David had to revert BQL on NIU driver, but NIU does the
skb_orphan() call as well so TSQ is basically disabled.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: performance regression on HiperSockets depending on MTU size
  2012-11-27  6:46     ` Eric Dumazet
@ 2012-11-28  5:31       ` Cong Wang
  0 siblings, 0 replies; 5+ messages in thread
From: Cong Wang @ 2012-11-28  5:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Kernel Network Developers

On Tue, Nov 27, 2012 at 2:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Tue, 2012-11-27 at 06:21 +0000, Cong Wang wrote:
>
>> Eric,
>>
>> Do you have a full list of such commits? I am trying to backport TSQ
>> to 2.6.32, and of course I don't want to miss these commits either.
>
> I dont think there are other known issues.
>
> mlx4 had a 'problem' because only recently we removed the skb_orphan()
> call it used to do in its start_xmit() function.
>
> I remember David had to revert BQL on NIU driver, but NIU does the
> skb_orphan() call as well so TSQ is basically disabled.
>
>


Good news! Thanks!

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-11-28  5:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-26 15:32 performance regression on HiperSockets depending on MTU size Frank Blaschka
2012-11-26 16:12 ` Eric Dumazet
2012-11-27  6:21   ` Cong Wang
2012-11-27  6:46     ` Eric Dumazet
2012-11-28  5:31       ` Cong Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).