* TSQ accounting skb->truesize degrades throughput for large packets @ 2013-09-06 10:16 Wei Liu 2013-09-06 12:57 ` Eric Dumazet 0 siblings, 1 reply; 20+ messages in thread From: Wei Liu @ 2013-09-06 10:16 UTC (permalink / raw) To: Eric Dumazet; +Cc: wei.liu2, Jonathan Davies, Ian Campbell, netdev, xen-devel Hi Eric I have some questions regarding TSQ and I hope you can shed some light on this. Our observation is that with the default TSQ limit (128K), throughput for Xen network driver for large packets degrades. That's because we now only have 1 packet in queue. I double-checked that skb->len is indeed <64K. Then I discovered that TSQ actually accounts for skb->truesize and the packets generated had skb->truesize > 64K which effectively prevented us from putting 2 packets in queue. There seems to be no way to limit skb->truesize inside driver -- the skb is already constructed when it comes to xen-netfront. My questions are: 1) I see the comment in tcp_output.c saying: "TSQ : sk_wmem_alloc accounts skb truesize, including skb overhead. But thats OK", I don't quite understand why it is OK. 2) presumably other drivers will suffer from this as well, is it possible to account for skb->len instead of skb->truesize? 3) if accounting skb->truesize is on purpose, does that mean we only need to tune that value instead of trying to fix our driver (if there is a way to)? Thanks Wei. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: TSQ accounting skb->truesize degrades throughput for large packets 2013-09-06 10:16 TSQ accounting skb->truesize degrades throughput for large packets Wei Liu @ 2013-09-06 12:57 ` Eric Dumazet 2013-09-06 13:12 ` Wei Liu 2013-09-06 16:36 ` Zoltan Kiss 0 siblings, 2 replies; 20+ messages in thread From: Eric Dumazet @ 2013-09-06 12:57 UTC (permalink / raw) To: Wei Liu; +Cc: Jonathan Davies, Ian Campbell, netdev, xen-devel On Fri, 2013-09-06 at 11:16 +0100, Wei Liu wrote: > Hi Eric > > I have some questions regarding TSQ and I hope you can shed some light > on this. > > Our observation is that with the default TSQ limit (128K), throughput > for Xen network driver for large packets degrades. That's because we now > only have 1 packet in queue. > > I double-checked that skb->len is indeed <64K. Then I discovered that > TSQ actually accounts for skb->truesize and the packets generated had > skb->truesize > 64K which effectively prevented us from putting 2 > packets in queue. > > There seems to be no way to limit skb->truesize inside driver -- the skb > is already constructed when it comes to xen-netfront. > What is the skb->truesize value then ? It must be huge, and its clearly a problem, because the tcp _receiver_ will also grow its window slower, if packet is looped back. > My questions are: > 1) I see the comment in tcp_output.c saying: "TSQ : sk_wmem_alloc > accounts skb truesize, including skb overhead. But thats OK", I > don't quite understand why it is OK. > 2) presumably other drivers will suffer from this as well, is it > possible to account for skb->len instead of skb->truesize? Well, I have no problem to get line rate on 20Gb with a single flow, so other drivers have no problem. > 3) if accounting skb->truesize is on purpose, does that mean we only > need to tune that value instead of trying to fix our driver (if > there is a way to)? The check in TCP allows for two packets at least, unless a single skb truesize is 128K ? if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) { set_bit(TSQ_THROTTLED, &tp->tsq_flags); break; } So if a skb->truesize is 100K, this condition allows two packets, before throttling the third packet. Its actually hard to account for skb->len, because sk_wmem_alloc accounts for skb->truesize : I do not want to add another sk->sk_wbytes_alloc new atomic field. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: TSQ accounting skb->truesize degrades throughput for large packets 2013-09-06 12:57 ` Eric Dumazet @ 2013-09-06 13:12 ` Wei Liu 2013-09-06 16:36 ` Zoltan Kiss 1 sibling, 0 replies; 20+ messages in thread From: Wei Liu @ 2013-09-06 13:12 UTC (permalink / raw) To: Eric Dumazet; +Cc: Wei Liu, Jonathan Davies, Ian Campbell, netdev, xen-devel On Fri, Sep 06, 2013 at 05:57:48AM -0700, Eric Dumazet wrote: > On Fri, 2013-09-06 at 11:16 +0100, Wei Liu wrote: > > Hi Eric > > > > I have some questions regarding TSQ and I hope you can shed some light > > on this. > > > > Our observation is that with the default TSQ limit (128K), throughput > > for Xen network driver for large packets degrades. That's because we now > > only have 1 packet in queue. > > > > I double-checked that skb->len is indeed <64K. Then I discovered that > > TSQ actually accounts for skb->truesize and the packets generated had > > skb->truesize > 64K which effectively prevented us from putting 2 > > packets in queue. > > > > There seems to be no way to limit skb->truesize inside driver -- the skb > > is already constructed when it comes to xen-netfront. > > > > What is the skb->truesize value then ? It must be huge, and its clearly > a problem, because the tcp _receiver_ will also grow its window slower, > if packet is looped back. > It's ~66KB. > > My questions are: > > 1) I see the comment in tcp_output.c saying: "TSQ : sk_wmem_alloc > > accounts skb truesize, including skb overhead. But thats OK", I > > don't quite understand why it is OK. > > 2) presumably other drivers will suffer from this as well, is it > > possible to account for skb->len instead of skb->truesize? > > Well, I have no problem to get line rate on 20Gb with a single flow, so > other drivers have no problem. > OK, good to know this. > > 3) if accounting skb->truesize is on purpose, does that mean we only > > need to tune that value instead of trying to fix our driver (if > > there is a way to)? > > The check in TCP allows for two packets at least, unless a single skb > truesize is 128K ? > > > if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) { > set_bit(TSQ_THROTTLED, &tp->tsq_flags); > break; > } > > So if a skb->truesize is 100K, this condition allows two packets, before > throttling the third packet. > OK. I need to check why we're getting only 1 then. Thanks for your reply. Wei. > Its actually hard to account for skb->len, because sk_wmem_alloc > accounts for skb->truesize : I do not want to add another > sk->sk_wbytes_alloc new atomic field. > > ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: TSQ accounting skb->truesize degrades throughput for large packets 2013-09-06 12:57 ` Eric Dumazet 2013-09-06 13:12 ` Wei Liu @ 2013-09-06 16:36 ` Zoltan Kiss 2013-09-06 16:56 ` Eric Dumazet 2013-09-06 17:00 ` Eric Dumazet 1 sibling, 2 replies; 20+ messages in thread From: Zoltan Kiss @ 2013-09-06 16:36 UTC (permalink / raw) To: Eric Dumazet; +Cc: Wei Liu, Jonathan Davies, Ian Campbell, netdev, xen-devel On 06/09/13 13:57, Eric Dumazet wrote: > Well, I have no problem to get line rate on 20Gb with a single flow, so > other drivers have no problem. I've made some tests on bare metal: Dell PE R815, Intel 82599EB 10Gb, 3.11-rc4 32 bit kernel with 3.17.3 ixgbe (TSO, GSO on), iperf 2.0.5 Transmitting packets toward the remote end (so running iperf -c on this host) can make 8.3 Gbps with the default 128k tcp_limit_output_bytes. When I increased this to 131.506 (128k + 434 bytes) suddenly it jumped to 9.4 Gbps. Iperf CPU usage also jumped a few percent from ~36 to ~40% (softint percentage in top also increased from ~3 to ~5%) So I guess it would be good to revisit the default value of this setting. What hw you used Eric for your 20Gb results? Regards, Zoli ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: TSQ accounting skb->truesize degrades throughput for large packets 2013-09-06 16:36 ` Zoltan Kiss @ 2013-09-06 16:56 ` Eric Dumazet 2013-09-09 9:27 ` Jason Wang 2013-09-06 17:00 ` Eric Dumazet 1 sibling, 1 reply; 20+ messages in thread From: Eric Dumazet @ 2013-09-06 16:56 UTC (permalink / raw) To: Zoltan Kiss; +Cc: Wei Liu, Jonathan Davies, Ian Campbell, netdev, xen-devel On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote: > On 06/09/13 13:57, Eric Dumazet wrote: > > Well, I have no problem to get line rate on 20Gb with a single flow, so > > other drivers have no problem. > I've made some tests on bare metal: > Dell PE R815, Intel 82599EB 10Gb, 3.11-rc4 32 bit kernel with 3.17.3 > ixgbe (TSO, GSO on), iperf 2.0.5 > Transmitting packets toward the remote end (so running iperf -c on this > host) can make 8.3 Gbps with the default 128k tcp_limit_output_bytes. > When I increased this to 131.506 (128k + 434 bytes) suddenly it jumped > to 9.4 Gbps. Iperf CPU usage also jumped a few percent from ~36 to ~40% > (softint percentage in top also increased from ~3 to ~5%) Typical tradeoff between latency and throughput If you favor throughput, then you can increase tcp_limit_output_bytes The default is quite reasonable IMHO. > So I guess it would be good to revisit the default value of this > setting. What hw you used Eric for your 20Gb results? Mellanox CX-3 Make sure your NIC doesn't hold TX packets in TX ring too long before signaling an interrupt for TX completion. For example I had to patch mellanox : commit ecfd2ce1a9d5e6376ff5c00b366345160abdbbb7 Author: Eric Dumazet <edumazet@google.com> Date: Mon Nov 5 16:20:42 2012 +0000 mlx4: change TX coalescing defaults mlx4 currently uses a too high tx coalescing setting, deferring TX completion interrupts by up to 128 us. With the recent skb_orphan() removal in commit 8112ec3b872, performance of a single TCP flow is capped to ~4 Gbps, unless we increase tcp_limit_output_bytes. I suggest using 16 us instead of 128 us, allowing a finer control. Performance of a single TCP flow is restored to previous levels, while keeping TCP small queues fully enabled with default sysctl. This patch is also a BQL prereq. Reported-by: Vimalkumar <j.vimal@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yevgeny Petrilin <yevgenyp@mellanox.com> Cc: Or Gerlitz <ogerlitz@mellanox.com> Acked-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: TSQ accounting skb->truesize degrades throughput for large packets 2013-09-06 16:56 ` Eric Dumazet @ 2013-09-09 9:27 ` Jason Wang 2013-09-09 13:47 ` Eric Dumazet 0 siblings, 1 reply; 20+ messages in thread From: Jason Wang @ 2013-09-09 9:27 UTC (permalink / raw) To: Eric Dumazet Cc: Zoltan Kiss, Wei Liu, Jonathan Davies, Ian Campbell, netdev, xen-devel, Michael S. Tsirkin On 09/07/2013 12:56 AM, Eric Dumazet wrote: > On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote: >> On 06/09/13 13:57, Eric Dumazet wrote: >>> Well, I have no problem to get line rate on 20Gb with a single flow, so >>> other drivers have no problem. >> I've made some tests on bare metal: >> Dell PE R815, Intel 82599EB 10Gb, 3.11-rc4 32 bit kernel with 3.17.3 >> ixgbe (TSO, GSO on), iperf 2.0.5 >> Transmitting packets toward the remote end (so running iperf -c on this >> host) can make 8.3 Gbps with the default 128k tcp_limit_output_bytes. >> When I increased this to 131.506 (128k + 434 bytes) suddenly it jumped >> to 9.4 Gbps. Iperf CPU usage also jumped a few percent from ~36 to ~40% >> (softint percentage in top also increased from ~3 to ~5%) > Typical tradeoff between latency and throughput > > If you favor throughput, then you can increase tcp_limit_output_bytes > > The default is quite reasonable IMHO. > >> So I guess it would be good to revisit the default value of this >> setting. What hw you used Eric for your 20Gb results? > Mellanox CX-3 > > Make sure your NIC doesn't hold TX packets in TX ring too long before > signaling an interrupt for TX completion. Virtio-net orphan the skb in .ndo_start_xmit() so TSQ can not throttle packets in device accurately, and it also can't do BQL. Does this means TSQ should be disabled for virtio-net? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: TSQ accounting skb->truesize degrades throughput for large packets 2013-09-09 9:27 ` Jason Wang @ 2013-09-09 13:47 ` Eric Dumazet 2013-09-10 7:45 ` Jason Wang 0 siblings, 1 reply; 20+ messages in thread From: Eric Dumazet @ 2013-09-09 13:47 UTC (permalink / raw) To: Jason Wang Cc: Zoltan Kiss, Wei Liu, Jonathan Davies, Ian Campbell, netdev, xen-devel, Michael S. Tsirkin On Mon, 2013-09-09 at 17:27 +0800, Jason Wang wrote: > Virtio-net orphan the skb in .ndo_start_xmit() so TSQ can not throttle > packets in device accurately, and it also can't do BQL. Does this means > TSQ should be disabled for virtio-net? > If skb are orphaned, there is no way TSQ can work at all. It is already disabled, so why do you want to disable it ? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: TSQ accounting skb->truesize degrades throughput for large packets 2013-09-09 13:47 ` Eric Dumazet @ 2013-09-10 7:45 ` Jason Wang 2013-09-10 12:35 ` Eric Dumazet 0 siblings, 1 reply; 20+ messages in thread From: Jason Wang @ 2013-09-10 7:45 UTC (permalink / raw) To: Eric Dumazet Cc: Zoltan Kiss, Wei Liu, Jonathan Davies, Ian Campbell, netdev, xen-devel, Michael S. Tsirkin On 09/09/2013 09:47 PM, Eric Dumazet wrote: > On Mon, 2013-09-09 at 17:27 +0800, Jason Wang wrote: > >> Virtio-net orphan the skb in .ndo_start_xmit() so TSQ can not throttle >> packets in device accurately, and it also can't do BQL. Does this means >> TSQ should be disabled for virtio-net? >> > If skb are orphaned, there is no way TSQ can work at all. For example, virtio-net will stop the tx queue when it finds the tx queue may full and enable the queue when some packets were sent. In this case, tsq works and throttles the total bytes queued in qdisc. This usually happen during heavy network load such as two sessions of netperf. > > It is already disabled, so why do you want to disable it ? > > We notice a regression, and bisect shows it was introduced by TSQ. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: TSQ accounting skb->truesize degrades throughput for large packets 2013-09-10 7:45 ` Jason Wang @ 2013-09-10 12:35 ` Eric Dumazet 0 siblings, 0 replies; 20+ messages in thread From: Eric Dumazet @ 2013-09-10 12:35 UTC (permalink / raw) To: Jason Wang Cc: Zoltan Kiss, Wei Liu, Jonathan Davies, Ian Campbell, netdev, xen-devel, Michael S. Tsirkin On Tue, 2013-09-10 at 15:45 +0800, Jason Wang wrote: > For example, virtio-net will stop the tx queue when it finds the tx > queue may full and enable the queue when some packets were sent. In this > case, tsq works and throttles the total bytes queued in qdisc. This > usually happen during heavy network load such as two sessions of netperf. You told me skb were _orphaned_. This automatically _disables_ TSQ, after packets leave Qdisc. So you have a problem because your skb orphaning is only working when packets leave Qdisc. If you cant afford sockets being throttled, make sure you have no Qdisc ! > We notice a regression, and bisect shows it was introduced by TSQ. You do realize TSQ is a balance between throughput and latencies ? In case of TSQ, it was very clear that limiting amount of outstanding bytes in queues could have an impact on bandwidth. Pushing Megabytes of TCP packets with identical TCP timestamps is bad, because it prevents us doing delay based congestion control and a single flow could fill the Qdisc with a thousand of packets. (Self induced delays, see BufferBloat discussions) One known problem in TCP stack is that sendmsg() locks the socket for the duration of the call. sendpage() do not have this problem. tcp_tsq_handler() is deferred if tcp_tasklet_func() finds a locked socket. The owner of socket will call tcp_tsq_handler() when socket is released. So if you use sendmsg() with large buffers or if copyin data from user land involves page faults, it may explain why you need larger number of in-flight bytes to sustain a given throughput. You could take a look at commit c9bee3b7fdecb0c1d070c ("tcp: TCP_NOTSENT_LOWAT socket option"), and play with /proc/sys/net/ipv4/tcp_notsent_lowat, to force sendmsg() to release the socket lock every hundreds of kbytes. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: TSQ accounting skb->truesize degrades throughput for large packets 2013-09-06 16:36 ` Zoltan Kiss 2013-09-06 16:56 ` Eric Dumazet @ 2013-09-06 17:00 ` Eric Dumazet 2013-09-07 17:21 ` Eric Dumazet 2013-09-09 5:28 ` TSQ accounting skb->truesize degrades throughput for large packets Cong Wang 1 sibling, 2 replies; 20+ messages in thread From: Eric Dumazet @ 2013-09-06 17:00 UTC (permalink / raw) To: Zoltan Kiss; +Cc: Wei Liu, Jonathan Davies, Ian Campbell, netdev, xen-devel On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote: > So I guess it would be good to revisit the default value of this > setting. If ixgbe requires 3 TSO packets in TX ring to get line rate, you also can tweak dev->gso_max_size from 65535 to 64000. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: TSQ accounting skb->truesize degrades throughput for large packets 2013-09-06 17:00 ` Eric Dumazet @ 2013-09-07 17:21 ` Eric Dumazet 2013-09-09 21:41 ` Zoltan Kiss 2013-09-09 5:28 ` TSQ accounting skb->truesize degrades throughput for large packets Cong Wang 1 sibling, 1 reply; 20+ messages in thread From: Eric Dumazet @ 2013-09-07 17:21 UTC (permalink / raw) To: Zoltan Kiss; +Cc: Wei Liu, Jonathan Davies, Ian Campbell, netdev, xen-devel On Fri, 2013-09-06 at 10:00 -0700, Eric Dumazet wrote: > On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote: > > > So I guess it would be good to revisit the default value of this > > setting. > > If ixgbe requires 3 TSO packets in TX ring to get line rate, you also > can tweak dev->gso_max_size from 65535 to 64000. Another idea would be to no longer use tcp_limit_output_bytes but max(sk_pacing_rate / 1000, 2*MSS) This means that number of packets in FQ would be limited to the equivalent of 1ms, so TCP could have faster response to packet losses : Retransmitted packets would not have to wait for prior packets being drained from FQ For a 8Gbps flow, 1Gbyte/s, sk_pacing_rate would be 2Gbyte, this would translate to ~2 Mbytes in Qdisc/TX ring. sk_pacing_rate was introduced in linux-3.12, but could be backported easily. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: TSQ accounting skb->truesize degrades throughput for large packets 2013-09-07 17:21 ` Eric Dumazet @ 2013-09-09 21:41 ` Zoltan Kiss 2013-09-09 21:56 ` Eric Dumazet 0 siblings, 1 reply; 20+ messages in thread From: Zoltan Kiss @ 2013-09-09 21:41 UTC (permalink / raw) To: Eric Dumazet; +Cc: Wei Liu, Jonathan Davies, Ian Campbell, netdev, xen-devel On 07/09/13 18:21, Eric Dumazet wrote: > On Fri, 2013-09-06 at 10:00 -0700, Eric Dumazet wrote: >> On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote: >> >>> So I guess it would be good to revisit the default value of this >>> setting. >> >> If ixgbe requires 3 TSO packets in TX ring to get line rate, you also >> can tweak dev->gso_max_size from 65535 to 64000. > > Another idea would be to no longer use tcp_limit_output_bytes but > > max(sk_pacing_rate / 1000, 2*MSS) I've tried this on a freshly updated upstream, and it solved my problem on ixgbe: - if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) { + if (atomic_read(&sk->sk_wmem_alloc) >= max(sk->sk_pacing_rate / 1000, 2 * mss_now) ){ Now I can get proper line rate. Btw. I've tried to decrease dev->gso_max_size to 60K or 32K, both was ineffective. Regards, Zoli ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: TSQ accounting skb->truesize degrades throughput for large packets 2013-09-09 21:41 ` Zoltan Kiss @ 2013-09-09 21:56 ` Eric Dumazet [not found] ` <loom.20130921T045654-573@post.gmane.org> 0 siblings, 1 reply; 20+ messages in thread From: Eric Dumazet @ 2013-09-09 21:56 UTC (permalink / raw) To: Zoltan Kiss; +Cc: Wei Liu, Jonathan Davies, Ian Campbell, netdev, xen-devel On Mon, 2013-09-09 at 22:41 +0100, Zoltan Kiss wrote: > On 07/09/13 18:21, Eric Dumazet wrote: > > On Fri, 2013-09-06 at 10:00 -0700, Eric Dumazet wrote: > >> On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote: > >> > >>> So I guess it would be good to revisit the default value of this > >>> setting. > >> > >> If ixgbe requires 3 TSO packets in TX ring to get line rate, you also > >> can tweak dev->gso_max_size from 65535 to 64000. > > > > Another idea would be to no longer use tcp_limit_output_bytes but > > > > max(sk_pacing_rate / 1000, 2*MSS) > > I've tried this on a freshly updated upstream, and it solved my problem > on ixgbe: > > - if (atomic_read(&sk->sk_wmem_alloc) >= > sysctl_tcp_limit_output_bytes) { > + if (atomic_read(&sk->sk_wmem_alloc) >= > max(sk->sk_pacing_rate / 1000, 2 * mss_now) ){ > > Now I can get proper line rate. Btw. I've tried to decrease > dev->gso_max_size to 60K or 32K, both was ineffective. Yeah, my own test was more like the following diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 7c83cb8..07dc77a 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1872,7 +1872,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, /* TSQ : sk_wmem_alloc accounts skb truesize, * including skb overhead. But thats OK. */ - if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) { + if (atomic_read(&sk->sk_wmem_alloc) >= max(2 * mss_now, + sk->sk_pacing_rate >> 8)) { set_bit(TSQ_THROTTLED, &tp->tsq_flags); break; } Note that it also seems to make Hystart happier. I will send patches when all tests are green. ^ permalink raw reply related [flat|nested] 20+ messages in thread
[parent not found: <loom.20130921T045654-573@post.gmane.org>]
[parent not found: <20130921150327.GA9078@zion.uk.xensource.com>]
* Re: [Xen-devel] TSQ accounting skb->truesize degrades throughput for large packets [not found] ` <20130921150327.GA9078@zion.uk.xensource.com> @ 2013-09-22 2:36 ` Cong Wang 2013-09-22 14:58 ` Eric Dumazet 0 siblings, 1 reply; 20+ messages in thread From: Cong Wang @ 2013-09-22 2:36 UTC (permalink / raw) To: Wei Liu; +Cc: xen-devel, Linux Kernel Network Developers, Eric Dumazet On Sat, Sep 21, 2013 at 11:03 PM, Wei Liu <wei.liu2@citrix.com> wrote: > On Sat, Sep 21, 2013 at 03:00:26AM +0000, Cong Wang wrote: >> Eric Dumazet <eric.dumazet <at> gmail.com> writes: >> >> > >> > Yeah, my own test was more like the following >> > >> ... >> > >> > Note that it also seems to make Hystart happier. >> > >> > I will send patches when all tests are green. >> > >> >> How is this going? I don't see any patch posted to netdev. >> > > I'm afraid you forgot to CC any relevant people in thie email. :-) > I was replying via newsgroup, not mailing list. :) Anyway, adding Eric and netdev now. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Xen-devel] TSQ accounting skb->truesize degrades throughput for large packets 2013-09-22 2:36 ` [Xen-devel] " Cong Wang @ 2013-09-22 14:58 ` Eric Dumazet 2013-09-27 10:28 ` [PATCH] tcp: TSQ can use a dynamic limit Eric Dumazet 0 siblings, 1 reply; 20+ messages in thread From: Eric Dumazet @ 2013-09-22 14:58 UTC (permalink / raw) To: Cong Wang; +Cc: Wei Liu, xen-devel, Linux Kernel Network Developers On Sun, 2013-09-22 at 10:36 +0800, Cong Wang wrote: > > I was replying via newsgroup, not mailing list. :) > > Anyway, adding Eric and netdev now. Yes, dont worry, this will be done on Monday or Tuesday. I am still in New Orleans after LPC 2013. ^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH] tcp: TSQ can use a dynamic limit 2013-09-22 14:58 ` Eric Dumazet @ 2013-09-27 10:28 ` Eric Dumazet 2013-09-27 15:08 ` Neal Cardwell ` (2 more replies) 0 siblings, 3 replies; 20+ messages in thread From: Eric Dumazet @ 2013-09-27 10:28 UTC (permalink / raw) To: Cong Wang, David Miller Cc: Wei Liu, Linux Kernel Network Developers, Yuchung Cheng, Neal Cardwell From: Eric Dumazet <edumazet@google.com> When TCP Small Queues was added, we used a sysctl to limit amount of packets queues on Qdisc/device queues for a given TCP flow. Problem is this limit is either too big for low rates, or too small for high rates. Now TCP stack has rate estimation in sk->sk_pacing_rate, and TSO auto sizing, it can better control number of packets in Qdisc/device queues. New limit is two packets or at least 1 to 2 ms worth of packets. Low rates flows benefit from this patch by having even smaller number of packets in queues, allowing for faster recovery, better RTT estimations. High rates flows benefit from this patch by allowing more than 2 packets in flight as we had reports this was a limiting factor to reach line rate. [ In particular if TX completion is delayed because of coalescing parameters ] Example for a single flow on 10Gbp link controlled by FQ/pacing 14 packets in flight instead of 2 $ tc -s -d qd qdisc fq 8001: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 1024 quantum 3028 initial_quantum 15140 Sent 1168459366606 bytes 771822841 pkt (dropped 0, overlimits 0 requeues 6822476) rate 9346Mbit 771713pps backlog 953820b 14p requeues 6822476 2047 flow, 2046 inactive, 1 throttled, delay 15673 ns 2372 gc, 0 highprio, 0 retrans, 9739249 throttled, 0 flows_plimit Note that sk_pacing_rate is currently set to twice the actual rate, but this might be refined in the future when a flow is in congestion avoidance. Additional change : skb->destructor should be set to tcp_wfree(). A future patch (for linux 3.13+) might remove tcp_limit_output_bytes Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> --- net/ipv4/tcp_output.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 7c83cb8..c20e406 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -895,8 +895,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, skb_orphan(skb); skb->sk = sk; - skb->destructor = (sysctl_tcp_limit_output_bytes > 0) ? - tcp_wfree : sock_wfree; + skb->destructor = tcp_wfree; atomic_add(skb->truesize, &sk->sk_wmem_alloc); /* Build TCP header and checksum it. */ @@ -1840,7 +1839,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, while ((skb = tcp_send_head(sk))) { unsigned int limit; - tso_segs = tcp_init_tso_segs(sk, skb, mss_now); BUG_ON(!tso_segs); @@ -1869,13 +1867,20 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, break; } - /* TSQ : sk_wmem_alloc accounts skb truesize, - * including skb overhead. But thats OK. + /* TCP Small Queues : + * Control number of packets in qdisc/devices to two packets / or ~1 ms. + * This allows for : + * - better RTT estimation and ACK scheduling + * - faster recovery + * - high rates */ - if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) { + limit = max(skb->truesize, sk->sk_pacing_rate >> 10); + + if (atomic_read(&sk->sk_wmem_alloc) > limit) { set_bit(TSQ_THROTTLED, &tp->tsq_flags); break; } + limit = mss_now; if (tso_segs > 1 && !tcp_urg_mode(tp)) limit = tcp_mss_split_point(sk, skb, mss_now, ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH] tcp: TSQ can use a dynamic limit 2013-09-27 10:28 ` [PATCH] tcp: TSQ can use a dynamic limit Eric Dumazet @ 2013-09-27 15:08 ` Neal Cardwell 2013-09-29 15:41 ` Cong Wang 2013-10-01 3:52 ` David Miller 2 siblings, 0 replies; 20+ messages in thread From: Neal Cardwell @ 2013-09-27 15:08 UTC (permalink / raw) To: Eric Dumazet Cc: Cong Wang, David Miller, Wei Liu, Linux Kernel Network Developers, Yuchung Cheng On Fri, Sep 27, 2013 at 6:28 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > From: Eric Dumazet <edumazet@google.com> > > When TCP Small Queues was added, we used a sysctl to limit amount of > packets queues on Qdisc/device queues for a given TCP flow. > > Problem is this limit is either too big for low rates, or too small > for high rates. > > Now TCP stack has rate estimation in sk->sk_pacing_rate, and TSO > auto sizing, it can better control number of packets in Qdisc/device > queues. > > New limit is two packets or at least 1 to 2 ms worth of packets. > > Low rates flows benefit from this patch by having even smaller > number of packets in queues, allowing for faster recovery, > better RTT estimations. > > High rates flows benefit from this patch by allowing more than 2 packets > in flight as we had reports this was a limiting factor to reach line > rate. [ In particular if TX completion is delayed because of coalescing > parameters ] > > Example for a single flow on 10Gbp link controlled by FQ/pacing > > 14 packets in flight instead of 2 > > $ tc -s -d qd > qdisc fq 8001: dev eth0 root refcnt 32 limit 10000p flow_limit 100p > buckets 1024 quantum 3028 initial_quantum 15140 > Sent 1168459366606 bytes 771822841 pkt (dropped 0, overlimits 0 > requeues 6822476) > rate 9346Mbit 771713pps backlog 953820b 14p requeues 6822476 > 2047 flow, 2046 inactive, 1 throttled, delay 15673 ns > 2372 gc, 0 highprio, 0 retrans, 9739249 throttled, 0 flows_plimit > > Note that sk_pacing_rate is currently set to twice the actual rate, but > this might be refined in the future when a flow is in congestion > avoidance. > > Additional change : skb->destructor should be set to tcp_wfree(). > > A future patch (for linux 3.13+) might remove tcp_limit_output_bytes > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Cc: Wei Liu <wei.liu2@citrix.com> > Cc: Cong Wang <xiyou.wangcong@gmail.com> > Cc: Yuchung Cheng <ycheng@google.com> > Cc: Neal Cardwell <ncardwell@google.com> > --- Acked-by: Neal Cardwell <ncardwell@google.com> neal ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] tcp: TSQ can use a dynamic limit 2013-09-27 10:28 ` [PATCH] tcp: TSQ can use a dynamic limit Eric Dumazet 2013-09-27 15:08 ` Neal Cardwell @ 2013-09-29 15:41 ` Cong Wang 2013-10-01 3:52 ` David Miller 2 siblings, 0 replies; 20+ messages in thread From: Cong Wang @ 2013-09-29 15:41 UTC (permalink / raw) To: Eric Dumazet Cc: David Miller, Wei Liu, Linux Kernel Network Developers, Yuchung Cheng, Neal Cardwell On Fri, Sep 27, 2013 at 3:28 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > Additional change : skb->destructor should be set to tcp_wfree(). > > A future patch (for linux 3.13+) might remove tcp_limit_output_bytes Yeah, now it is only used in tcp_xmit_size_goal()... ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH] tcp: TSQ can use a dynamic limit 2013-09-27 10:28 ` [PATCH] tcp: TSQ can use a dynamic limit Eric Dumazet 2013-09-27 15:08 ` Neal Cardwell 2013-09-29 15:41 ` Cong Wang @ 2013-10-01 3:52 ` David Miller 2 siblings, 0 replies; 20+ messages in thread From: David Miller @ 2013-10-01 3:52 UTC (permalink / raw) To: eric.dumazet; +Cc: xiyou.wangcong, wei.liu2, netdev, ycheng, ncardwell From: Eric Dumazet <eric.dumazet@gmail.com> Date: Fri, 27 Sep 2013 03:28:54 -0700 > From: Eric Dumazet <edumazet@google.com> > > When TCP Small Queues was added, we used a sysctl to limit amount of > packets queues on Qdisc/device queues for a given TCP flow. > > Problem is this limit is either too big for low rates, or too small > for high rates. > > Now TCP stack has rate estimation in sk->sk_pacing_rate, and TSO > auto sizing, it can better control number of packets in Qdisc/device > queues. > > New limit is two packets or at least 1 to 2 ms worth of packets. > > Low rates flows benefit from this patch by having even smaller > number of packets in queues, allowing for faster recovery, > better RTT estimations. > > High rates flows benefit from this patch by allowing more than 2 packets > in flight as we had reports this was a limiting factor to reach line > rate. [ In particular if TX completion is delayed because of coalescing > parameters ] > > Example for a single flow on 10Gbp link controlled by FQ/pacing > > 14 packets in flight instead of 2 > > $ tc -s -d qd > qdisc fq 8001: dev eth0 root refcnt 32 limit 10000p flow_limit 100p > buckets 1024 quantum 3028 initial_quantum 15140 > Sent 1168459366606 bytes 771822841 pkt (dropped 0, overlimits 0 > requeues 6822476) > rate 9346Mbit 771713pps backlog 953820b 14p requeues 6822476 > 2047 flow, 2046 inactive, 1 throttled, delay 15673 ns > 2372 gc, 0 highprio, 0 retrans, 9739249 throttled, 0 flows_plimit > > Note that sk_pacing_rate is currently set to twice the actual rate, but > this might be refined in the future when a flow is in congestion > avoidance. > > Additional change : skb->destructor should be set to tcp_wfree(). > > A future patch (for linux 3.13+) might remove tcp_limit_output_bytes > > Signed-off-by: Eric Dumazet <edumazet@google.com> Applied, thanks Eric. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: TSQ accounting skb->truesize degrades throughput for large packets 2013-09-06 17:00 ` Eric Dumazet 2013-09-07 17:21 ` Eric Dumazet @ 2013-09-09 5:28 ` Cong Wang 1 sibling, 0 replies; 20+ messages in thread From: Cong Wang @ 2013-09-09 5:28 UTC (permalink / raw) To: netdev On Fri, 06 Sep 2013 at 17:00 GMT, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote: > >> So I guess it would be good to revisit the default value of this >> setting. > > If ixgbe requires 3 TSO packets in TX ring to get line rate, you also > can tweak dev->gso_max_size from 65535 to 64000. > We observe similar regression on ixgbe driver, also virtio_net driver. ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2013-10-01 3:45 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-09-06 10:16 TSQ accounting skb->truesize degrades throughput for large packets Wei Liu 2013-09-06 12:57 ` Eric Dumazet 2013-09-06 13:12 ` Wei Liu 2013-09-06 16:36 ` Zoltan Kiss 2013-09-06 16:56 ` Eric Dumazet 2013-09-09 9:27 ` Jason Wang 2013-09-09 13:47 ` Eric Dumazet 2013-09-10 7:45 ` Jason Wang 2013-09-10 12:35 ` Eric Dumazet 2013-09-06 17:00 ` Eric Dumazet 2013-09-07 17:21 ` Eric Dumazet 2013-09-09 21:41 ` Zoltan Kiss 2013-09-09 21:56 ` Eric Dumazet [not found] ` <loom.20130921T045654-573@post.gmane.org> [not found] ` <20130921150327.GA9078@zion.uk.xensource.com> 2013-09-22 2:36 ` [Xen-devel] " Cong Wang 2013-09-22 14:58 ` Eric Dumazet 2013-09-27 10:28 ` [PATCH] tcp: TSQ can use a dynamic limit Eric Dumazet 2013-09-27 15:08 ` Neal Cardwell 2013-09-29 15:41 ` Cong Wang 2013-10-01 3:52 ` David Miller 2013-09-09 5:28 ` TSQ accounting skb->truesize degrades throughput for large packets Cong Wang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).