TSQ accounting skb->truesize degrades throughput for large packets

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* TSQ accounting skb->truesize degrades throughput for large packets
@ 2013-09-06 10:16 Wei Liu
  2013-09-06 12:57 ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Wei Liu @ 2013-09-06 10:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: wei.liu2, Jonathan Davies, Ian Campbell, netdev, xen-devel

Hi Eric

I have some questions regarding TSQ and I hope you can shed some light
on this.

Our observation is that with the default TSQ limit (128K), throughput
for Xen network driver for large packets degrades. That's because we now
only have 1 packet in queue.

I double-checked that skb->len is indeed <64K. Then I discovered that
TSQ actually accounts for skb->truesize and the packets generated had
skb->truesize > 64K which effectively prevented us from putting 2
packets in queue.

There seems to be no way to limit skb->truesize inside driver -- the skb
is already constructed when it comes to xen-netfront.

My questions are:
  1) I see the comment in tcp_output.c saying: "TSQ : sk_wmem_alloc
     accounts skb truesize, including skb overhead. But thats OK", I
     don't quite understand why it is OK.
  2) presumably other drivers will suffer from this as well, is it
     possible to account for skb->len instead of skb->truesize?
  3) if accounting skb->truesize is on purpose, does that mean we only
     need to tune that value instead of trying to fix our driver (if
     there is a way to)?

Thanks
Wei.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: TSQ accounting skb->truesize degrades throughput for large packets
  2013-09-06 10:16 TSQ accounting skb->truesize degrades throughput for large packets Wei Liu
@ 2013-09-06 12:57 ` Eric Dumazet
  2013-09-06 13:12   ` Wei Liu
  2013-09-06 16:36   ` Zoltan Kiss
  0 siblings, 2 replies; 20+ messages in thread
From: Eric Dumazet @ 2013-09-06 12:57 UTC (permalink / raw)
  To: Wei Liu; +Cc: Jonathan Davies, Ian Campbell, netdev, xen-devel

On Fri, 2013-09-06 at 11:16 +0100, Wei Liu wrote:
> Hi Eric
> 
> I have some questions regarding TSQ and I hope you can shed some light
> on this.
> 
> Our observation is that with the default TSQ limit (128K), throughput
> for Xen network driver for large packets degrades. That's because we now
> only have 1 packet in queue.
> 
> I double-checked that skb->len is indeed <64K. Then I discovered that
> TSQ actually accounts for skb->truesize and the packets generated had
> skb->truesize > 64K which effectively prevented us from putting 2
> packets in queue.
> 
> There seems to be no way to limit skb->truesize inside driver -- the skb
> is already constructed when it comes to xen-netfront.
> 

What is the skb->truesize value then ? It must be huge, and its clearly
a problem, because the tcp _receiver_ will also grow its window slower,
if packet is looped back.

> My questions are:
>   1) I see the comment in tcp_output.c saying: "TSQ : sk_wmem_alloc
>      accounts skb truesize, including skb overhead. But thats OK", I
>      don't quite understand why it is OK.
>   2) presumably other drivers will suffer from this as well, is it
>      possible to account for skb->len instead of skb->truesize?

Well, I have no problem to get line rate on 20Gb with a single flow, so
other drivers have no problem.

>   3) if accounting skb->truesize is on purpose, does that mean we only
>      need to tune that value instead of trying to fix our driver (if
>      there is a way to)?

The check in TCP allows for two packets at least, unless a single skb
truesize is 128K ?


if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) {
    set_bit(TSQ_THROTTLED, &tp->tsq_flags);
    break;
}

So if a skb->truesize is 100K, this condition allows two packets, before
throttling the third packet.

Its actually hard to account for skb->len, because sk_wmem_alloc
accounts for skb->truesize : I do not want to add another
sk->sk_wbytes_alloc new atomic field.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: TSQ accounting skb->truesize degrades throughput for large packets
  2013-09-06 12:57 ` Eric Dumazet
@ 2013-09-06 13:12   ` Wei Liu
  2013-09-06 16:36   ` Zoltan Kiss
  1 sibling, 0 replies; 20+ messages in thread
From: Wei Liu @ 2013-09-06 13:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Wei Liu, Jonathan Davies, Ian Campbell, netdev, xen-devel

On Fri, Sep 06, 2013 at 05:57:48AM -0700, Eric Dumazet wrote:
> On Fri, 2013-09-06 at 11:16 +0100, Wei Liu wrote:
> > Hi Eric
> > 
> > I have some questions regarding TSQ and I hope you can shed some light
> > on this.
> > 
> > Our observation is that with the default TSQ limit (128K), throughput
> > for Xen network driver for large packets degrades. That's because we now
> > only have 1 packet in queue.
> > 
> > I double-checked that skb->len is indeed <64K. Then I discovered that
> > TSQ actually accounts for skb->truesize and the packets generated had
> > skb->truesize > 64K which effectively prevented us from putting 2
> > packets in queue.
> > 
> > There seems to be no way to limit skb->truesize inside driver -- the skb
> > is already constructed when it comes to xen-netfront.
> > 
> 
> What is the skb->truesize value then ? It must be huge, and its clearly
> a problem, because the tcp _receiver_ will also grow its window slower,
> if packet is looped back.
> 

It's ~66KB.

> > My questions are:
> >   1) I see the comment in tcp_output.c saying: "TSQ : sk_wmem_alloc
> >      accounts skb truesize, including skb overhead. But thats OK", I
> >      don't quite understand why it is OK.
> >   2) presumably other drivers will suffer from this as well, is it
> >      possible to account for skb->len instead of skb->truesize?
> 
> Well, I have no problem to get line rate on 20Gb with a single flow, so
> other drivers have no problem.
> 

OK, good to know this.

> >   3) if accounting skb->truesize is on purpose, does that mean we only
> >      need to tune that value instead of trying to fix our driver (if
> >      there is a way to)?
> 
> The check in TCP allows for two packets at least, unless a single skb
> truesize is 128K ?
> 
> 
> if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) {
>     set_bit(TSQ_THROTTLED, &tp->tsq_flags);
>     break;
> }
> 
> So if a skb->truesize is 100K, this condition allows two packets, before
> throttling the third packet.
> 

OK. I need to check why we're getting only 1 then.

Thanks for your reply.

Wei.

> Its actually hard to account for skb->len, because sk_wmem_alloc
> accounts for skb->truesize : I do not want to add another
> sk->sk_wbytes_alloc new atomic field.
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: TSQ accounting skb->truesize degrades throughput for large packets
  2013-09-06 12:57 ` Eric Dumazet
  2013-09-06 13:12   ` Wei Liu
@ 2013-09-06 16:36   ` Zoltan Kiss
  2013-09-06 16:56     ` Eric Dumazet
  2013-09-06 17:00     ` Eric Dumazet
  1 sibling, 2 replies; 20+ messages in thread
From: Zoltan Kiss @ 2013-09-06 16:36 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Wei Liu, Jonathan Davies, Ian Campbell, netdev, xen-devel

On 06/09/13 13:57, Eric Dumazet wrote:
> Well, I have no problem to get line rate on 20Gb with a single flow, so
> other drivers have no problem.
I've made some tests on bare metal:
Dell PE R815, Intel 82599EB 10Gb, 3.11-rc4 32 bit kernel with 3.17.3 
ixgbe (TSO, GSO on), iperf 2.0.5
Transmitting packets toward the remote end (so running iperf -c on this 
host) can make 8.3 Gbps with the default 128k tcp_limit_output_bytes. 
When I increased this to 131.506 (128k + 434 bytes) suddenly it jumped 
to 9.4 Gbps. Iperf CPU usage also jumped a few percent from ~36 to ~40% 
(softint percentage in top also increased from ~3 to ~5%)
So I guess it would be good to revisit the default value of this 
setting. What hw you used Eric for your 20Gb results?

Regards,

Zoli

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: TSQ accounting skb->truesize degrades throughput for large packets
  2013-09-06 16:36   ` Zoltan Kiss
@ 2013-09-06 16:56     ` Eric Dumazet
  2013-09-09  9:27       ` Jason Wang
  2013-09-06 17:00     ` Eric Dumazet
  1 sibling, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2013-09-06 16:56 UTC (permalink / raw)
  To: Zoltan Kiss; +Cc: Wei Liu, Jonathan Davies, Ian Campbell, netdev, xen-devel

On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote:
> On 06/09/13 13:57, Eric Dumazet wrote:
> > Well, I have no problem to get line rate on 20Gb with a single flow, so
> > other drivers have no problem.
> I've made some tests on bare metal:
> Dell PE R815, Intel 82599EB 10Gb, 3.11-rc4 32 bit kernel with 3.17.3 
> ixgbe (TSO, GSO on), iperf 2.0.5
> Transmitting packets toward the remote end (so running iperf -c on this 
> host) can make 8.3 Gbps with the default 128k tcp_limit_output_bytes. 
> When I increased this to 131.506 (128k + 434 bytes) suddenly it jumped 
> to 9.4 Gbps. Iperf CPU usage also jumped a few percent from ~36 to ~40% 
> (softint percentage in top also increased from ~3 to ~5%)

Typical tradeoff between latency and throughput

If you favor throughput, then you can increase tcp_limit_output_bytes

The default is quite reasonable IMHO.

> So I guess it would be good to revisit the default value of this 
> setting. What hw you used Eric for your 20Gb results?

Mellanox CX-3

Make sure your NIC doesn't hold TX packets in TX ring too long before
signaling an interrupt for TX completion.

For example I had to patch mellanox :

commit ecfd2ce1a9d5e6376ff5c00b366345160abdbbb7
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 5 16:20:42 2012 +0000

    mlx4: change TX coalescing defaults
    
    mlx4 currently uses a too high tx coalescing setting, deferring
    TX completion interrupts by up to 128 us.
    
    With the recent skb_orphan() removal in commit 8112ec3b872,
    performance of a single TCP flow is capped to ~4 Gbps, unless
    we increase tcp_limit_output_bytes.
    
    I suggest using 16 us instead of 128 us, allowing a finer control.
    
    Performance of a single TCP flow is restored to previous levels,
    while keeping TCP small queues fully enabled with default sysctl.
    
    This patch is also a BQL prereq.
    
    Reported-by: Vimalkumar <j.vimal@gmail.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Yevgeny Petrilin <yevgenyp@mellanox.com>
    Cc: Or Gerlitz <ogerlitz@mellanox.com>
    Acked-by: Amir Vadai <amirv@mellanox.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: TSQ accounting skb->truesize degrades throughput for large packets
  2013-09-06 16:56     ` Eric Dumazet
@ 2013-09-09  9:27       ` Jason Wang
  2013-09-09 13:47         ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Jason Wang @ 2013-09-09  9:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Zoltan Kiss, Wei Liu, Jonathan Davies, Ian Campbell, netdev,
	xen-devel, Michael S. Tsirkin

On 09/07/2013 12:56 AM, Eric Dumazet wrote:
> On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote:
>> On 06/09/13 13:57, Eric Dumazet wrote:
>>> Well, I have no problem to get line rate on 20Gb with a single flow, so
>>> other drivers have no problem.
>> I've made some tests on bare metal:
>> Dell PE R815, Intel 82599EB 10Gb, 3.11-rc4 32 bit kernel with 3.17.3 
>> ixgbe (TSO, GSO on), iperf 2.0.5
>> Transmitting packets toward the remote end (so running iperf -c on this 
>> host) can make 8.3 Gbps with the default 128k tcp_limit_output_bytes. 
>> When I increased this to 131.506 (128k + 434 bytes) suddenly it jumped 
>> to 9.4 Gbps. Iperf CPU usage also jumped a few percent from ~36 to ~40% 
>> (softint percentage in top also increased from ~3 to ~5%)
> Typical tradeoff between latency and throughput
>
> If you favor throughput, then you can increase tcp_limit_output_bytes
>
> The default is quite reasonable IMHO.
>
>> So I guess it would be good to revisit the default value of this 
>> setting. What hw you used Eric for your 20Gb results?
> Mellanox CX-3
>
> Make sure your NIC doesn't hold TX packets in TX ring too long before
> signaling an interrupt for TX completion.

Virtio-net orphan the skb in .ndo_start_xmit() so TSQ can not throttle
packets in device accurately, and it also can't do BQL. Does this means
TSQ should be disabled for virtio-net?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: TSQ accounting skb->truesize degrades throughput for large packets
  2013-09-09  9:27       ` Jason Wang
@ 2013-09-09 13:47         ` Eric Dumazet
  2013-09-10  7:45           ` Jason Wang
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2013-09-09 13:47 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zoltan Kiss, Wei Liu, Jonathan Davies, Ian Campbell, netdev,
	xen-devel, Michael S. Tsirkin

On Mon, 2013-09-09 at 17:27 +0800, Jason Wang wrote:

> Virtio-net orphan the skb in .ndo_start_xmit() so TSQ can not throttle
> packets in device accurately, and it also can't do BQL. Does this means
> TSQ should be disabled for virtio-net?
> 

If skb are orphaned, there is no way TSQ can work at all.

It is already disabled, so why do you want to disable it ?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: TSQ accounting skb->truesize degrades throughput for large packets
  2013-09-09 13:47         ` Eric Dumazet
@ 2013-09-10  7:45           ` Jason Wang
  2013-09-10 12:35             ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Jason Wang @ 2013-09-10  7:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Zoltan Kiss, Wei Liu, Jonathan Davies, Ian Campbell, netdev,
	xen-devel, Michael S. Tsirkin

On 09/09/2013 09:47 PM, Eric Dumazet wrote:
> On Mon, 2013-09-09 at 17:27 +0800, Jason Wang wrote:
>
>> Virtio-net orphan the skb in .ndo_start_xmit() so TSQ can not throttle
>> packets in device accurately, and it also can't do BQL. Does this means
>> TSQ should be disabled for virtio-net?
>>
> If skb are orphaned, there is no way TSQ can work at all.

For example, virtio-net will stop the tx queue when it finds the tx
queue may full and enable the queue when some packets were sent. In this
case, tsq works and throttles the total bytes queued in qdisc. This
usually happen during heavy network load such as two sessions of netperf.
>
> It is already disabled, so why do you want to disable it ?
>
>

We notice a regression, and bisect shows it was introduced by TSQ.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: TSQ accounting skb->truesize degrades throughput for large packets
  2013-09-10  7:45           ` Jason Wang
@ 2013-09-10 12:35             ` Eric Dumazet
  0 siblings, 0 replies; 20+ messages in thread
From: Eric Dumazet @ 2013-09-10 12:35 UTC (permalink / raw)
  To: Jason Wang
  Cc: Zoltan Kiss, Wei Liu, Jonathan Davies, Ian Campbell, netdev,
	xen-devel, Michael S. Tsirkin

On Tue, 2013-09-10 at 15:45 +0800, Jason Wang wrote:

> For example, virtio-net will stop the tx queue when it finds the tx
> queue may full and enable the queue when some packets were sent. In this
> case, tsq works and throttles the total bytes queued in qdisc. This
> usually happen during heavy network load such as two sessions of netperf.

You told me skb were _orphaned_.

This automatically _disables_ TSQ, after packets leave Qdisc.

So you have a problem because your skb orphaning is only working when
packets leave Qdisc.

If you cant afford sockets being throttled, make sure you have no
Qdisc !

> We notice a regression, and bisect shows it was introduced by TSQ.

You do realize TSQ is a balance between throughput and latencies ?

In case of TSQ, it was very clear that limiting amount of outstanding
bytes in queues could have an impact on bandwidth.

Pushing Megabytes of TCP packets with identical TCP timestamps is
bad, because it prevents us doing delay based congestion control and
a single flow could fill the Qdisc with a thousand of packets.
(Self induced delays, see BufferBloat discussions)

One known problem in TCP stack is that sendmsg() locks the socket for
the duration of the call. sendpage() do not have this problem.

tcp_tsq_handler() is deferred if tcp_tasklet_func() finds a locked
socket. The owner of socket will call tcp_tsq_handler() when socket is
released.

So if you use sendmsg() with large buffers or if copyin data from user
land involves page faults, it may explain why you need larger number of
in-flight bytes to sustain a given throughput.

You could take a look at commit c9bee3b7fdecb0c1d070c
("tcp: TCP_NOTSENT_LOWAT socket option"), and play
with /proc/sys/net/ipv4/tcp_notsent_lowat, to force sendmsg() to release
the socket lock every hundreds of kbytes.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: TSQ accounting skb->truesize degrades throughput for large packets
  2013-09-06 16:36   ` Zoltan Kiss
  2013-09-06 16:56     ` Eric Dumazet
@ 2013-09-06 17:00     ` Eric Dumazet
  2013-09-07 17:21       ` Eric Dumazet
  2013-09-09  5:28       ` TSQ accounting skb->truesize degrades throughput for large packets Cong Wang
  1 sibling, 2 replies; 20+ messages in thread
From: Eric Dumazet @ 2013-09-06 17:00 UTC (permalink / raw)
  To: Zoltan Kiss; +Cc: Wei Liu, Jonathan Davies, Ian Campbell, netdev, xen-devel

On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote:

> So I guess it would be good to revisit the default value of this 
> setting.

If ixgbe requires 3 TSO packets in TX ring to get line rate, you also
can tweak dev->gso_max_size from 65535 to 64000.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: TSQ accounting skb->truesize degrades throughput for large packets
  2013-09-06 17:00     ` Eric Dumazet
@ 2013-09-07 17:21       ` Eric Dumazet
  2013-09-09 21:41         ` Zoltan Kiss
  2013-09-09  5:28       ` TSQ accounting skb->truesize degrades throughput for large packets Cong Wang
  1 sibling, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2013-09-07 17:21 UTC (permalink / raw)
  To: Zoltan Kiss; +Cc: Wei Liu, Jonathan Davies, Ian Campbell, netdev, xen-devel

On Fri, 2013-09-06 at 10:00 -0700, Eric Dumazet wrote:
> On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote:
> 
> > So I guess it would be good to revisit the default value of this 
> > setting.
> 
> If ixgbe requires 3 TSO packets in TX ring to get line rate, you also
> can tweak dev->gso_max_size from 65535 to 64000.

Another idea would be to no longer use tcp_limit_output_bytes but

max(sk_pacing_rate / 1000, 2*MSS)

This means that number of packets in FQ would be limited to the
equivalent of 1ms, so TCP could have faster response to packet losses : 

Retransmitted packets would not have to wait for prior packets being
drained from FQ

For a 8Gbps flow, 1Gbyte/s, sk_pacing_rate would be 2Gbyte, this would
translate to ~2 Mbytes in Qdisc/TX ring.

sk_pacing_rate was introduced in linux-3.12, but could be backported
easily.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: TSQ accounting skb->truesize degrades throughput for large packets
  2013-09-07 17:21       ` Eric Dumazet
@ 2013-09-09 21:41         ` Zoltan Kiss
  2013-09-09 21:56           ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Zoltan Kiss @ 2013-09-09 21:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Wei Liu, Jonathan Davies, Ian Campbell, netdev, xen-devel

On 07/09/13 18:21, Eric Dumazet wrote:
> On Fri, 2013-09-06 at 10:00 -0700, Eric Dumazet wrote:
>> On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote:
>>
>>> So I guess it would be good to revisit the default value of this
>>> setting.
>>
>> If ixgbe requires 3 TSO packets in TX ring to get line rate, you also
>> can tweak dev->gso_max_size from 65535 to 64000.
>
> Another idea would be to no longer use tcp_limit_output_bytes but
>
> max(sk_pacing_rate / 1000, 2*MSS)

I've tried this on a freshly updated upstream, and it solved my problem 
on ixgbe:

-               if (atomic_read(&sk->sk_wmem_alloc) >= 
sysctl_tcp_limit_output_bytes) {
+               if (atomic_read(&sk->sk_wmem_alloc) >= 
max(sk->sk_pacing_rate / 1000, 2 * mss_now) ){

Now I can get proper line rate. Btw. I've tried to decrease 
dev->gso_max_size to 60K or 32K, both was ineffective.

Regards,

Zoli

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: TSQ accounting skb->truesize degrades throughput for large packets
  2013-09-09 21:41         ` Zoltan Kiss
@ 2013-09-09 21:56           ` Eric Dumazet
       [not found]             ` <loom.20130921T045654-573@post.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2013-09-09 21:56 UTC (permalink / raw)
  To: Zoltan Kiss; +Cc: Wei Liu, Jonathan Davies, Ian Campbell, netdev, xen-devel

On Mon, 2013-09-09 at 22:41 +0100, Zoltan Kiss wrote:
> On 07/09/13 18:21, Eric Dumazet wrote:
> > On Fri, 2013-09-06 at 10:00 -0700, Eric Dumazet wrote:
> >> On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote:
> >>
> >>> So I guess it would be good to revisit the default value of this
> >>> setting.
> >>
> >> If ixgbe requires 3 TSO packets in TX ring to get line rate, you also
> >> can tweak dev->gso_max_size from 65535 to 64000.
> >
> > Another idea would be to no longer use tcp_limit_output_bytes but
> >
> > max(sk_pacing_rate / 1000, 2*MSS)
> 
> I've tried this on a freshly updated upstream, and it solved my problem 
> on ixgbe:
> 
> -               if (atomic_read(&sk->sk_wmem_alloc) >= 
> sysctl_tcp_limit_output_bytes) {
> +               if (atomic_read(&sk->sk_wmem_alloc) >= 
> max(sk->sk_pacing_rate / 1000, 2 * mss_now) ){
> 
> Now I can get proper line rate. Btw. I've tried to decrease 
> dev->gso_max_size to 60K or 32K, both was ineffective.

Yeah, my own test was more like the following


diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7c83cb8..07dc77a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1872,7 +1872,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		/* TSQ : sk_wmem_alloc accounts skb truesize,
 		 * including skb overhead. But thats OK.
 		 */
-		if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) {
+		if (atomic_read(&sk->sk_wmem_alloc) >= max(2 * mss_now,
+							   sk->sk_pacing_rate >> 8)) {
 			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
 			break;
 		}


Note that it also seems to make Hystart happier.

I will send patches when all tests are green.

^ permalink raw reply related	[flat|nested] 20+ messages in thread

[parent not found: <loom.20130921T045654-573@post.gmane.org>]

[parent not found: <20130921150327.GA9078@zion.uk.xensource.com>]

* Re: [Xen-devel] TSQ accounting skb->truesize degrades throughput for large packets
       [not found]               ` <20130921150327.GA9078@zion.uk.xensource.com>
@ 2013-09-22  2:36                 ` Cong Wang
  2013-09-22 14:58                   ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Cong Wang @ 2013-09-22  2:36 UTC (permalink / raw)
  To: Wei Liu; +Cc: xen-devel, Linux Kernel Network Developers, Eric Dumazet

On Sat, Sep 21, 2013 at 11:03 PM, Wei Liu <wei.liu2@citrix.com> wrote:
> On Sat, Sep 21, 2013 at 03:00:26AM +0000, Cong Wang wrote:
>> Eric Dumazet <eric.dumazet <at> gmail.com> writes:
>>
>> >
>> > Yeah, my own test was more like the following
>> >
>> ...
>> >
>> > Note that it also seems to make Hystart happier.
>> >
>> > I will send patches when all tests are green.
>> >
>>
>> How is this going? I don't see any patch posted to netdev.
>>
>
> I'm afraid you forgot to CC any relevant people in thie email. :-)
>

I was replying via newsgroup, not mailing list. :)

Anyway, adding Eric and netdev now.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Xen-devel] TSQ accounting skb->truesize degrades throughput for large packets
  2013-09-22  2:36                 ` [Xen-devel] " Cong Wang
@ 2013-09-22 14:58                   ` Eric Dumazet
  2013-09-27 10:28                     ` [PATCH] tcp: TSQ can use a dynamic limit Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2013-09-22 14:58 UTC (permalink / raw)
  To: Cong Wang; +Cc: Wei Liu, xen-devel, Linux Kernel Network Developers

On Sun, 2013-09-22 at 10:36 +0800, Cong Wang wrote:

> 
> I was replying via newsgroup, not mailing list. :)
> 
> Anyway, adding Eric and netdev now.

Yes, dont worry, this will be done on Monday or Tuesday.

I am still in New Orleans after LPC 2013.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] tcp: TSQ can use a dynamic limit
  2013-09-22 14:58                   ` Eric Dumazet
@ 2013-09-27 10:28                     ` Eric Dumazet
  2013-09-27 15:08                       ` Neal Cardwell
                                         ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Eric Dumazet @ 2013-09-27 10:28 UTC (permalink / raw)
  To: Cong Wang, David Miller
  Cc: Wei Liu, Linux Kernel Network Developers, Yuchung Cheng,
	Neal Cardwell

From: Eric Dumazet <edumazet@google.com>

When TCP Small Queues was added, we used a sysctl to limit amount of
packets queues on Qdisc/device queues for a given TCP flow.

Problem is this limit is either too big for low rates, or too small
for high rates.

Now TCP stack has rate estimation in sk->sk_pacing_rate, and TSO 
auto sizing, it can better control number of packets in Qdisc/device
queues.

New limit is two packets or at least 1 to 2 ms worth of packets.

Low rates flows benefit from this patch by having even smaller
number of packets in queues, allowing for faster recovery,
better RTT estimations.

High rates flows benefit from this patch by allowing more than 2 packets
in flight as we had reports this was a limiting factor to reach line
rate. [ In particular if TX completion is delayed because of coalescing
parameters ]

Example for a single flow on 10Gbp link controlled by FQ/pacing

14 packets in flight instead of 2

$ tc -s -d qd
qdisc fq 8001: dev eth0 root refcnt 32 limit 10000p flow_limit 100p
buckets 1024 quantum 3028 initial_quantum 15140 
 Sent 1168459366606 bytes 771822841 pkt (dropped 0, overlimits 0
requeues 6822476) 
 rate 9346Mbit 771713pps backlog 953820b 14p requeues 6822476 
  2047 flow, 2046 inactive, 1 throttled, delay 15673 ns
  2372 gc, 0 highprio, 0 retrans, 9739249 throttled, 0 flows_plimit

Note that sk_pacing_rate is currently set to twice the actual rate, but
this might be refined in the future when a flow is in congestion
avoidance.

Additional change : skb->destructor should be set to tcp_wfree().

A future patch (for linux 3.13+) might remove tcp_limit_output_bytes

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
---
 net/ipv4/tcp_output.c |   17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7c83cb8..c20e406 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -895,8 +895,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 
 	skb_orphan(skb);
 	skb->sk = sk;
-	skb->destructor = (sysctl_tcp_limit_output_bytes > 0) ?
-			  tcp_wfree : sock_wfree;
+	skb->destructor = tcp_wfree;
 	atomic_add(skb->truesize, &sk->sk_wmem_alloc);
 
 	/* Build TCP header and checksum it. */
@@ -1840,7 +1839,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 	while ((skb = tcp_send_head(sk))) {
 		unsigned int limit;
 
-
 		tso_segs = tcp_init_tso_segs(sk, skb, mss_now);
 		BUG_ON(!tso_segs);
 
@@ -1869,13 +1867,20 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		}
 
-		/* TSQ : sk_wmem_alloc accounts skb truesize,
-		 * including skb overhead. But thats OK.
+		/* TCP Small Queues :
+		 * Control number of packets in qdisc/devices to two packets / or ~1 ms.
+		 * This allows for :
+		 *  - better RTT estimation and ACK scheduling
+		 *  - faster recovery
+		 *  - high rates
 		 */
-		if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) {
+		limit = max(skb->truesize, sk->sk_pacing_rate >> 10);
+
+		if (atomic_read(&sk->sk_wmem_alloc) > limit) {
 			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
 			break;
 		}
+
 		limit = mss_now;
 		if (tso_segs > 1 && !tcp_urg_mode(tp))
 			limit = tcp_mss_split_point(sk, skb, mss_now,

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] tcp: TSQ can use a dynamic limit
  2013-09-27 10:28                     ` [PATCH] tcp: TSQ can use a dynamic limit Eric Dumazet
@ 2013-09-27 15:08                       ` Neal Cardwell
  2013-09-29 15:41                       ` Cong Wang
  2013-10-01  3:52                       ` David Miller
  2 siblings, 0 replies; 20+ messages in thread
From: Neal Cardwell @ 2013-09-27 15:08 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Cong Wang, David Miller, Wei Liu, Linux Kernel Network Developers,
	Yuchung Cheng

On Fri, Sep 27, 2013 at 6:28 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> When TCP Small Queues was added, we used a sysctl to limit amount of
> packets queues on Qdisc/device queues for a given TCP flow.
>
> Problem is this limit is either too big for low rates, or too small
> for high rates.
>
> Now TCP stack has rate estimation in sk->sk_pacing_rate, and TSO
> auto sizing, it can better control number of packets in Qdisc/device
> queues.
>
> New limit is two packets or at least 1 to 2 ms worth of packets.
>
> Low rates flows benefit from this patch by having even smaller
> number of packets in queues, allowing for faster recovery,
> better RTT estimations.
>
> High rates flows benefit from this patch by allowing more than 2 packets
> in flight as we had reports this was a limiting factor to reach line
> rate. [ In particular if TX completion is delayed because of coalescing
> parameters ]
>
> Example for a single flow on 10Gbp link controlled by FQ/pacing
>
> 14 packets in flight instead of 2
>
> $ tc -s -d qd
> qdisc fq 8001: dev eth0 root refcnt 32 limit 10000p flow_limit 100p
> buckets 1024 quantum 3028 initial_quantum 15140
>  Sent 1168459366606 bytes 771822841 pkt (dropped 0, overlimits 0
> requeues 6822476)
>  rate 9346Mbit 771713pps backlog 953820b 14p requeues 6822476
>   2047 flow, 2046 inactive, 1 throttled, delay 15673 ns
>   2372 gc, 0 highprio, 0 retrans, 9739249 throttled, 0 flows_plimit
>
> Note that sk_pacing_rate is currently set to twice the actual rate, but
> this might be refined in the future when a flow is in congestion
> avoidance.
>
> Additional change : skb->destructor should be set to tcp_wfree().
>
> A future patch (for linux 3.13+) might remove tcp_limit_output_bytes
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Wei Liu <wei.liu2@citrix.com>
> Cc: Cong Wang <xiyou.wangcong@gmail.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> ---

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] tcp: TSQ can use a dynamic limit
  2013-09-27 10:28                     ` [PATCH] tcp: TSQ can use a dynamic limit Eric Dumazet
  2013-09-27 15:08                       ` Neal Cardwell
@ 2013-09-29 15:41                       ` Cong Wang
  2013-10-01  3:52                       ` David Miller
  2 siblings, 0 replies; 20+ messages in thread
From: Cong Wang @ 2013-09-29 15:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Wei Liu, Linux Kernel Network Developers,
	Yuchung Cheng, Neal Cardwell

On Fri, Sep 27, 2013 at 3:28 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Additional change : skb->destructor should be set to tcp_wfree().
>
> A future patch (for linux 3.13+) might remove tcp_limit_output_bytes

Yeah, now it is only used in tcp_xmit_size_goal()...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] tcp: TSQ can use a dynamic limit
  2013-09-27 10:28                     ` [PATCH] tcp: TSQ can use a dynamic limit Eric Dumazet
  2013-09-27 15:08                       ` Neal Cardwell
  2013-09-29 15:41                       ` Cong Wang
@ 2013-10-01  3:52                       ` David Miller
  2 siblings, 0 replies; 20+ messages in thread
From: David Miller @ 2013-10-01  3:52 UTC (permalink / raw)
  To: eric.dumazet; +Cc: xiyou.wangcong, wei.liu2, netdev, ycheng, ncardwell

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 27 Sep 2013 03:28:54 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> When TCP Small Queues was added, we used a sysctl to limit amount of
> packets queues on Qdisc/device queues for a given TCP flow.
> 
> Problem is this limit is either too big for low rates, or too small
> for high rates.
> 
> Now TCP stack has rate estimation in sk->sk_pacing_rate, and TSO 
> auto sizing, it can better control number of packets in Qdisc/device
> queues.
> 
> New limit is two packets or at least 1 to 2 ms worth of packets.
> 
> Low rates flows benefit from this patch by having even smaller
> number of packets in queues, allowing for faster recovery,
> better RTT estimations.
> 
> High rates flows benefit from this patch by allowing more than 2 packets
> in flight as we had reports this was a limiting factor to reach line
> rate. [ In particular if TX completion is delayed because of coalescing
> parameters ]
> 
> Example for a single flow on 10Gbp link controlled by FQ/pacing
> 
> 14 packets in flight instead of 2
> 
> $ tc -s -d qd
> qdisc fq 8001: dev eth0 root refcnt 32 limit 10000p flow_limit 100p
> buckets 1024 quantum 3028 initial_quantum 15140 
>  Sent 1168459366606 bytes 771822841 pkt (dropped 0, overlimits 0
> requeues 6822476) 
>  rate 9346Mbit 771713pps backlog 953820b 14p requeues 6822476 
>   2047 flow, 2046 inactive, 1 throttled, delay 15673 ns
>   2372 gc, 0 highprio, 0 retrans, 9739249 throttled, 0 flows_plimit
> 
> Note that sk_pacing_rate is currently set to twice the actual rate, but
> this might be refined in the future when a flow is in congestion
> avoidance.
> 
> Additional change : skb->destructor should be set to tcp_wfree().
> 
> A future patch (for linux 3.13+) might remove tcp_limit_output_bytes
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: TSQ accounting skb->truesize degrades throughput for large packets
  2013-09-06 17:00     ` Eric Dumazet
  2013-09-07 17:21       ` Eric Dumazet
@ 2013-09-09  5:28       ` Cong Wang
  1 sibling, 0 replies; 20+ messages in thread
From: Cong Wang @ 2013-09-09  5:28 UTC (permalink / raw)
  To: netdev

On Fri, 06 Sep 2013 at 17:00 GMT, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote:
>
>> So I guess it would be good to revisit the default value of this 
>> setting.
>
> If ixgbe requires 3 TSO packets in TX ring to get line rate, you also
> can tweak dev->gso_max_size from 65535 to 64000.
>

We observe similar regression on ixgbe driver, also virtio_net driver.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2013-10-01  3:45 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-06 10:16 TSQ accounting skb->truesize degrades throughput for large packets Wei Liu
2013-09-06 12:57 ` Eric Dumazet
2013-09-06 13:12   ` Wei Liu
2013-09-06 16:36   ` Zoltan Kiss
2013-09-06 16:56     ` Eric Dumazet
2013-09-09  9:27       ` Jason Wang
2013-09-09 13:47         ` Eric Dumazet
2013-09-10  7:45           ` Jason Wang
2013-09-10 12:35             ` Eric Dumazet
2013-09-06 17:00     ` Eric Dumazet
2013-09-07 17:21       ` Eric Dumazet
2013-09-09 21:41         ` Zoltan Kiss
2013-09-09 21:56           ` Eric Dumazet
     [not found]             ` <loom.20130921T045654-573@post.gmane.org>
     [not found]               ` <20130921150327.GA9078@zion.uk.xensource.com>
2013-09-22  2:36                 ` [Xen-devel] " Cong Wang
2013-09-22 14:58                   ` Eric Dumazet
2013-09-27 10:28                     ` [PATCH] tcp: TSQ can use a dynamic limit Eric Dumazet
2013-09-27 15:08                       ` Neal Cardwell
2013-09-29 15:41                       ` Cong Wang
2013-10-01  3:52                       ` David Miller
2013-09-09  5:28       ` TSQ accounting skb->truesize degrades throughput for large packets Cong Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).