16% regression on 10G caused by TCP small queues

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* 16% regression on 10G caused by TCP small queues
@ 2013-10-24  2:29 Stephen Hemminger
  2013-10-24  2:37 ` Neal Cardwell
  0 siblings, 1 reply; 11+ messages in thread
From: Stephen Hemminger @ 2013-10-24  2:29 UTC (permalink / raw)
  To: Eric Dumazet, David Miller, Dave Täht; +Cc: netdev

In the course of testing routing functionality, I discovered a that the single flow TCP
throughput was much worse than expected. At first, it looked like a router problem,
or maybe because one end was a FreeBSD system (which has noticeably slower TCP performance).
But reducing it down to two systems directly connected over 10G (ixgbe) found the problem.

With a single TCP flow, in 3.5 kernel the performance with iperf is 9.41 Gbit/sec
which is at the link limit for TCP with timestamps etc. But in 3.6 and later the
throughput dropped to 7.9 Gbit/sec which is a regression of 16%.

Doing bisect shows that the commit causing this is:

  commit 46d3ceabd8d98ed0ad10f20c595ca784e34786c5
  Author: Eric Dumazet <eric.dumazet@gmail.com>
  Date:   Wed Jul 11 05:50:31 2012 +0000

    tcp: TCP Small Queues

    This introduce TSQ (TCP Small Queues)

There are several options at this point:
  0. Ignore it. Sorry, this is not acceptable.
     People do transfer files over 10G and expect line rate!

  1. Rip it out. which adds to the buffer bloat.
     This is a throughput vs latency tradeoff.

  2. Neuter it by making TCP small queues configurable and default off.
     Allows people who are willing to sacrifice performance go ahead and
     enable it.

  3. Tweak it. Make the default queue value in kernel big enough that no loss is
     observable.

  4. Do something smarter like a dynamic TCP small queue that adapts.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 16% regression on 10G caused by TCP small queues
  2013-10-24  2:29 16% regression on 10G caused by TCP small queues Stephen Hemminger
@ 2013-10-24  2:37 ` Neal Cardwell
  2013-10-24  3:09   ` Stephen Hemminger
  0 siblings, 1 reply; 11+ messages in thread
From: Neal Cardwell @ 2013-10-24  2:37 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Eric Dumazet, David Miller, Dave Täht, Netdev

On Wed, Oct 23, 2013 at 10:29 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> In the course of testing routing functionality, I discovered a that the single flow TCP
> throughput was much worse than expected. At first, it looked like a router problem,
> or maybe because one end was a FreeBSD system (which has noticeably slower TCP performance).
> But reducing it down to two systems directly connected over 10G (ixgbe) found the problem.
...
>   4. Do something smarter like a dynamic TCP small queue that adapts.

Yep, Eric made TSQ dynamic a few weeks ago, and mentioned that his
commit helps a single flow on 10Gbps link:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c9eeec26e32e087359160406f96e0949b3cc6f10

Can you please check the performance in your setup on 3.12-rc4 or newer? :-)

Thanks!

neal

---

commit c9eeec26e32e087359160406f96e0949b3cc6f10
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Sep 27 03:28:54 2013 -0700

    tcp: TSQ can use a dynamic limit

    When TCP Small Queues was added, we used a sysctl to limit amount of
    packets queues on Qdisc/device queues for a given TCP flow.

    Problem is this limit is either too big for low rates, or too small
    for high rates.

    Now TCP stack has rate estimation in sk->sk_pacing_rate, and TSO
    auto sizing, it can better control number of packets in Qdisc/device
    queues.

    New limit is two packets or at least 1 to 2 ms worth of packets.

    Low rates flows benefit from this patch by having even smaller
    number of packets in queues, allowing for faster recovery,
    better RTT estimations.

    High rates flows benefit from this patch by allowing more than 2 packets
    in flight as we had reports this was a limiting factor to reach line
    rate. [ In particular if TX completion is delayed because of coalescing
    parameters ]

    Example for a single flow on 10Gbp link controlled by FQ/pacing

    14 packets in flight instead of 2
    ...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 16% regression on 10G caused by TCP small queues
  2013-10-24  2:37 ` Neal Cardwell
@ 2013-10-24  3:09   ` Stephen Hemminger
  2013-10-24  3:38     ` David Miller
  0 siblings, 1 reply; 11+ messages in thread
From: Stephen Hemminger @ 2013-10-24  3:09 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: Eric Dumazet, David Miller, Dave Täht, Netdev

On Wed, Oct 23, 2013 at 7:37 PM, Neal Cardwell <ncardwell@google.com> wrote:
> On Wed, Oct 23, 2013 at 10:29 PM, Stephen Hemminger
> <stephen@networkplumber.org> wrote:
>> In the course of testing routing functionality, I discovered a that the single flow TCP
>> throughput was much worse than expected. At first, it looked like a router problem,
>> or maybe because one end was a FreeBSD system (which has noticeably slower TCP performance).
>> But reducing it down to two systems directly connected over 10G (ixgbe) found the problem.
> ...
>>   4. Do something smarter like a dynamic TCP small queue that adapts.
>
> Yep, Eric made TSQ dynamic a few weeks ago, and mentioned that his
> commit helps a single flow on 10Gbps link:
>
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c9eeec26e32e087359160406f96e0949b3cc6f10
>
> Can you please check the performance in your setup on 3.12-rc4 or newer? :-)
>
> Thanks!
>
> neal

I will check 3.12, but what about users on 3.10 which is the LTS
kernel used by most distros?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 16% regression on 10G caused by TCP small queues
  2013-10-24  3:09   ` Stephen Hemminger
@ 2013-10-24  3:38     ` David Miller
  2013-10-24  4:45       ` Stephen Hemminger
  0 siblings, 1 reply; 11+ messages in thread
From: David Miller @ 2013-10-24  3:38 UTC (permalink / raw)
  To: stephen; +Cc: ncardwell, eric.dumazet, dave.taht, netdev

From: Stephen Hemminger <stephen@networkplumber.org>
Date: Wed, 23 Oct 2013 20:09:49 -0700

> I will check 3.12, but what about users on 3.10 which is the LTS
> kernel used by most distros?

The fix will be backported to -stable, relax Stephen.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 16% regression on 10G caused by TCP small queues
  2013-10-24  3:38     ` David Miller
@ 2013-10-24  4:45       ` Stephen Hemminger
  2013-10-24  6:05         ` Eric Dumazet
                           ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Stephen Hemminger @ 2013-10-24  4:45 UTC (permalink / raw)
  To: David Miller; +Cc: ncardwell, eric.dumazet, dave.taht, netdev

On Wed, 23 Oct 2013 23:38:16 -0400 (EDT)
David Miller <davem@davemloft.net> wrote:

> From: Stephen Hemminger <stephen@networkplumber.org>
> Date: Wed, 23 Oct 2013 20:09:49 -0700
> 
> > I will check 3.12, but what about users on 3.10 which is the LTS
> > kernel used by most distros?

3.12-rc6 gets line rate again (9.41 Gbit/sec)

> The fix will be backported to -stable, relax Stephen.

Sorry, thought sk_pacing_rate depended on FQ qdisc but it is other way around.
In which case doing merge of these two was sufficient to fix the problem.
With a minor manual fix up to tcp.h.


commit 95bd09eb27507691520d39ee1044d6ad831c1168
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Aug 27 05:46:32 2013 -0700

    tcp: TSO packets automatic sizing

commit c9eeec26e32e087359160406f96e0949b3cc6f10
Author: Eric Dumazet <edumazet@google.com>
Date:   Fri Sep 27 03:28:54 2013 -0700

    tcp: TSQ can use a dynamic limit

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 16% regression on 10G caused by TCP small queues
  2013-10-24  4:45       ` Stephen Hemminger
@ 2013-10-24  6:05         ` Eric Dumazet
  2013-10-24  6:10         ` Eric Dumazet
  2013-10-24  7:01         ` David Miller
  2 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2013-10-24  6:05 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, ncardwell, dave.taht, netdev

On Wed, 2013-10-23 at 21:45 -0700, Stephen Hemminger wrote:
> On Wed, 23 Oct 2013 23:38:16 -0400 (EDT)
> David Miller <davem@davemloft.net> wrote:
> 
> > From: Stephen Hemminger <stephen@networkplumber.org>
> > Date: Wed, 23 Oct 2013 20:09:49 -0700
> > 
> > > I will check 3.12, but what about users on 3.10 which is the LTS
> > > kernel used by most distros?
> 
> 3.12-rc6 gets line rate again (9.41 Gbit/sec)
> 
> > The fix will be backported to -stable, relax Stephen.
> 
> Sorry, thought sk_pacing_rate depended on FQ qdisc but it is other way around.
> In which case doing merge of these two was sufficient to fix the problem.
> With a minor manual fix up to tcp.h.
> 
> 
> commit 95bd09eb27507691520d39ee1044d6ad831c1168
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Tue Aug 27 05:46:32 2013 -0700
> 
>     tcp: TSO packets automatic sizing
> 
> commit c9eeec26e32e087359160406f96e0949b3cc6f10
> Author: Eric Dumazet <edumazet@google.com>
> Date:   Fri Sep 27 03:28:54 2013 -0700
> 
>     tcp: TSQ can use a dynamic limit

Perfect, thanks for testing Stephen !

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 16% regression on 10G caused by TCP small queues
  2013-10-24  4:45       ` Stephen Hemminger
  2013-10-24  6:05         ` Eric Dumazet
@ 2013-10-24  6:10         ` Eric Dumazet
  2013-10-24  6:19           ` Eric Dumazet
  2013-10-24  7:01         ` David Miller
  2 siblings, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2013-10-24  6:10 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, ncardwell, dave.taht, netdev

On Wed, 2013-10-23 at 21:45 -0700, Stephen Hemminger wrote:
> On Wed, 23 Oct 2013 23:38:16 -0400 (EDT)
> David Miller <davem@davemloft.net> wrote:
> 
> > From: Stephen Hemminger <stephen@networkplumber.org>
> > Date: Wed, 23 Oct 2013 20:09:49 -0700
> > 
> > > I will check 3.12, but what about users on 3.10 which is the LTS
> > > kernel used by most distros?
> 
> 3.12-rc6 gets line rate again (9.41 Gbit/sec)
> 
> > The fix will be backported to -stable, relax Stephen.
> 
> Sorry, thought sk_pacing_rate depended on FQ qdisc but it is other way around.
> In which case doing merge of these two was sufficient to fix the problem.
> With a minor manual fix up to tcp.h.

Btw what is the NIC you are using ?

I also had to patch mlx4 driver because it had too big coalescing
parameters, it was before TCP Small Queue dynamic sizing, but its worth
noting that for the initial ramp up (when flow sk_pacing_rate is low
because initial cwin is 10), it might make a difference

This was 

commit ecfd2ce1a9d5e6376ff5c00b366345160abdbbb7
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 5 16:20:42 2012 +0000

    mlx4: change TX coalescing defaults
    
    mlx4 currently uses a too high tx coalescing setting, deferring
    TX completion interrupts by up to 128 us.
    
    With the recent skb_orphan() removal in commit 8112ec3b872,
    performance of a single TCP flow is capped to ~4 Gbps, unless
    we increase tcp_limit_output_bytes.
    
    I suggest using 16 us instead of 128 us, allowing a finer control.
    
    Performance of a single TCP flow is restored to previous levels,
    while keeping TCP small queues fully enabled with default sysctl.
    
    This patch is also a BQL prereq.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 16% regression on 10G caused by TCP small queues
  2013-10-24  6:10         ` Eric Dumazet
@ 2013-10-24  6:19           ` Eric Dumazet
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2013-10-24  6:19 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, ncardwell, dave.taht, netdev

On Wed, 2013-10-23 at 23:10 -0700, Eric Dumazet wrote:

> Btw what is the NIC you are using ?

Oh well I read you mail and saw ixgbe was mentioned ;)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 16% regression on 10G caused by TCP small queues
  2013-10-24  4:45       ` Stephen Hemminger
  2013-10-24  6:05         ` Eric Dumazet
  2013-10-24  6:10         ` Eric Dumazet
@ 2013-10-24  7:01         ` David Miller
  2013-10-27  4:33           ` Dave Taht
  2 siblings, 1 reply; 11+ messages in thread
From: David Miller @ 2013-10-24  7:01 UTC (permalink / raw)
  To: stephen; +Cc: ncardwell, eric.dumazet, dave.taht, netdev

From: Stephen Hemminger <stephen@networkplumber.org>
Date: Wed, 23 Oct 2013 21:45:57 -0700

> Sorry, thought sk_pacing_rate depended on FQ qdisc but it is other way around.
> In which case doing merge of these two was sufficient to fix the problem.
> With a minor manual fix up to tcp.h.

I know, I already have a half-built tree of -stable submissions
that does exactlty this.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 16% regression on 10G caused by TCP small queues
  2013-10-24  7:01         ` David Miller
@ 2013-10-27  4:33           ` Dave Taht
  2013-10-27  5:07             ` David Miller
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Taht @ 2013-10-27  4:33 UTC (permalink / raw)
  To: David Miller; +Cc: stephen, ncardwell, eric.dumazet, netdev

On Thu, Oct 24, 2013 at 03:01:21AM -0400, David Miller wrote:
> From: Stephen Hemminger <stephen@networkplumber.org>
> Date: Wed, 23 Oct 2013 21:45:57 -0700
> 
> > Sorry, thought sk_pacing_rate depended on FQ qdisc but it is other way around.
> > In which case doing merge of these two was sufficient to fix the problem.
> > With a minor manual fix up to tcp.h.
> 
> I know, I already have a half-built tree of -stable submissions
> that does exactlty this.

I know that the "fq" qdisc is not exactly a -stable thing, but if it's simpler
to include it rather than sort through the patch sets, I'm all for it.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 16% regression on 10G caused by TCP small queues
  2013-10-27  4:33           ` Dave Taht
@ 2013-10-27  5:07             ` David Miller
  0 siblings, 0 replies; 11+ messages in thread
From: David Miller @ 2013-10-27  5:07 UTC (permalink / raw)
  To: dave.taht; +Cc: stephen, ncardwell, eric.dumazet, netdev

From: Dave Taht <dave.taht@bufferbloat.net>
Date: Sat, 26 Oct 2013 21:33:52 -0700

> On Thu, Oct 24, 2013 at 03:01:21AM -0400, David Miller wrote:
>> From: Stephen Hemminger <stephen@networkplumber.org>
>> Date: Wed, 23 Oct 2013 21:45:57 -0700
>> 
>> > Sorry, thought sk_pacing_rate depended on FQ qdisc but it is other way around.
>> > In which case doing merge of these two was sufficient to fix the problem.
>> > With a minor manual fix up to tcp.h.
>> 
>> I know, I already have a half-built tree of -stable submissions
>> that does exactlty this.
> 
> I know that the "fq" qdisc is not exactly a -stable thing, but if it's simpler
> to include it rather than sort through the patch sets, I'm all for it.

I'm not including 'fq' and it's absolutely not necessary to fix this
bug.

'fq' is only incidentary to this bug fix because it just so happens to
make use of sk->sk_pacing_rate.  There is no other connection between
these two things.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-10-27  5:07 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-24  2:29 16% regression on 10G caused by TCP small queues Stephen Hemminger
2013-10-24  2:37 ` Neal Cardwell
2013-10-24  3:09   ` Stephen Hemminger
2013-10-24  3:38     ` David Miller
2013-10-24  4:45       ` Stephen Hemminger
2013-10-24  6:05         ` Eric Dumazet
2013-10-24  6:10         ` Eric Dumazet
2013-10-24  6:19           ` Eric Dumazet
2013-10-24  7:01         ` David Miller
2013-10-27  4:33           ` Dave Taht
2013-10-27  5:07             ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).