* TCP performance regression
@ 2013-11-11 5:30 Sujith Manoharan
2013-11-11 5:55 ` Eric Dumazet
0 siblings, 1 reply; 26+ messages in thread
From: Sujith Manoharan @ 2013-11-11 5:30 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
Hi,
The commit, "tcp: TSQ can use a dynamic limit" causes a large
performance drop in TCP transmission with the wireless driver ath9k.
With a 2-stream card (AR9462), the usual throughput is around 195 Mbps.
But, with this commit, it drops to ~125 Mbps, occasionally reaching 130.
If the commit is reverted, performance is normal again and I can get
190+ Mbps. Apparently, ath10k is also affected and a 250 Mbps drop
is seen (from an original 740 Mbps).
I am using Linville's wireless-testing tree.
>From the test machine:
root@linux-test ~# uname -a
Linux linux-test 3.12.0-wl-nodebug #104 SMP PREEMPT Mon Nov 11 10:27:56 IST 2013 x86_64 GNU/Linux
root@linux-test ~# tc -d -s qdisc show dev wlan0
qdisc mq 0: root
Sent 342682272 bytes 226366 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0
root@linux-test ~# zgrep -i net_sch /proc/config.gz
CONFIG_NET_SCHED=y
CONFIG_NET_SCH_CBQ=m
CONFIG_NET_SCH_HTB=m
CONFIG_NET_SCH_HFSC=m
CONFIG_NET_SCH_ATM=m
CONFIG_NET_SCH_PRIO=m
CONFIG_NET_SCH_MULTIQ=m
CONFIG_NET_SCH_RED=m
CONFIG_NET_SCH_SFB=m
CONFIG_NET_SCH_SFQ=m
CONFIG_NET_SCH_TEQL=m
CONFIG_NET_SCH_TBF=m
CONFIG_NET_SCH_GRED=m
CONFIG_NET_SCH_DSMARK=m
CONFIG_NET_SCH_NETEM=m
CONFIG_NET_SCH_DRR=m
CONFIG_NET_SCH_MQPRIO=m
CONFIG_NET_SCH_CHOKE=m
CONFIG_NET_SCH_QFQ=m
CONFIG_NET_SCH_CODEL=m
CONFIG_NET_SCH_FQ_CODEL=m
CONFIG_NET_SCH_FQ=m
CONFIG_NET_SCH_INGRESS=m
# CONFIG_NET_SCH_PLUG is not set
CONFIG_NET_SCH_FIFO=y
If more information is required, please let me know.
Sujith
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: TCP performance regression 2013-11-11 5:30 TCP performance regression Sujith Manoharan @ 2013-11-11 5:55 ` Eric Dumazet 2013-11-11 6:07 ` Sujith Manoharan 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2013-11-11 5:55 UTC (permalink / raw) To: Sujith Manoharan; +Cc: Eric Dumazet, netdev On Mon, 2013-11-11 at 11:00 +0530, Sujith Manoharan wrote: > Hi, > > The commit, "tcp: TSQ can use a dynamic limit" causes a large > performance drop in TCP transmission with the wireless driver ath9k. > > With a 2-stream card (AR9462), the usual throughput is around 195 Mbps. > But, with this commit, it drops to ~125 Mbps, occasionally reaching 130. > > If the commit is reverted, performance is normal again and I can get > 190+ Mbps. Apparently, ath10k is also affected and a 250 Mbps drop > is seen (from an original 740 Mbps). I am afraid this commit shows bugs in various network drivers. All drivers doing TX completion using a timer are buggy. Random example : drivers/net/ethernet/marvell/mvneta.c #define MVNETA_TX_DONE_TIMER_PERIOD 10 /* Trigger tx done timer in MVNETA_TX_DONE_TIMER_PERIOD msecs */ static void mvneta_add_tx_done_timer(struct mvneta_port *pp) { if (test_and_set_bit(MVNETA_F_TX_DONE_TIMER_BIT, &pp->flags) == 0) { pp->tx_done_timer.expires = jiffies + msecs_to_jiffies(MVNETA_TX_DONE_TIMER_PERIOD); add_timer(&pp->tx_done_timer); } } Holding skb 10 ms before TX completion is totally wrong and must be fixed. If really NIC is not able to trigger an interrupt after TX completion, then driver should call skb_orphan() in its ndo_start_xmit() ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: TCP performance regression 2013-11-11 5:55 ` Eric Dumazet @ 2013-11-11 6:07 ` Sujith Manoharan 2013-11-11 6:54 ` Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: Sujith Manoharan @ 2013-11-11 6:07 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev Eric Dumazet wrote: > I am afraid this commit shows bugs in various network drivers. > > All drivers doing TX completion using a timer are buggy. > > Holding skb 10 ms before TX completion is totally wrong and must be fixed. > > If really NIC is not able to trigger an interrupt after TX completion, then > driver should call skb_orphan() in its ndo_start_xmit() 802.11 AMPDU formation is done in the TX completion path in ath9k. Incoming frames are added to a software queue and the TX completion tasklet checks if enough frames are available to form an aggregate and if so, forms new aggregates and transmits them. There is no timer involved, but the completion routine is rather heavy. Many wireless drivers handle 802.11 aggregation in this way: ath9k, ath9k_htc, ath10k etc. Sujith ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: TCP performance regression 2013-11-11 6:07 ` Sujith Manoharan @ 2013-11-11 6:54 ` Eric Dumazet 2013-11-11 8:19 ` Sujith Manoharan 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2013-11-11 6:54 UTC (permalink / raw) To: Sujith Manoharan; +Cc: netdev On Mon, 2013-11-11 at 11:37 +0530, Sujith Manoharan wrote: > Eric Dumazet wrote: > > I am afraid this commit shows bugs in various network drivers. > > > > All drivers doing TX completion using a timer are buggy. > > > > Holding skb 10 ms before TX completion is totally wrong and must be fixed. > > > > If really NIC is not able to trigger an interrupt after TX completion, then > > driver should call skb_orphan() in its ndo_start_xmit() > > 802.11 AMPDU formation is done in the TX completion path in ath9k. > > Incoming frames are added to a software queue and the TX completion > tasklet checks if enough frames are available to form an aggregate and > if so, forms new aggregates and transmits them. > Hmm... apparently ath9k uses : #define ATH_AMPDU_LIMIT_MAX (64 * 1024 - 1) And mentions a 4ms time frame : max_4ms_framelen = ATH_AMPDU_LIMIT_MAX; So prior to "tcp: TSQ can use a dynamic limit", the ~128KB bytes TCP could queue per TCP socket on qdisc/NIC would happen to please ath9k ath9k can set rts_aggr_limit to 8*1024 : if (AR_SREV_9160_10_OR_LATER(ah) || AR_SREV_9100(ah)) pCap->rts_aggr_limit = ATH_AMPDU_LIMIT_MAX; else pCap->rts_aggr_limit = (8 * 1024); > There is no timer involved, but the completion routine is rather heavy. > Many wireless drivers handle 802.11 aggregation in this way: > ath9k, ath9k_htc, ath10k etc. > A timer would be definitely needed, and it should be rather small (1 or 2 ms) If TCP socket is application limited, it seems ath9k can delay the last block by a too long time. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: TCP performance regression 2013-11-11 6:54 ` Eric Dumazet @ 2013-11-11 8:19 ` Sujith Manoharan 2013-11-11 14:27 ` Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: Sujith Manoharan @ 2013-11-11 8:19 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev Eric Dumazet wrote: > Hmm... apparently ath9k uses : > > #define ATH_AMPDU_LIMIT_MAX (64 * 1024 - 1) This is the maximum AMPDU size, specified in the 802.11 standard. > And mentions a 4ms time frame : > > max_4ms_framelen = ATH_AMPDU_LIMIT_MAX; The 4ms limitation is a FCC limitation and is used for regulatory compliance. > So prior to "tcp: TSQ can use a dynamic limit", the ~128KB bytes TCP > could queue per TCP socket on qdisc/NIC would happen to please ath9k Ok. > ath9k can set rts_aggr_limit to 8*1024 : > > if (AR_SREV_9160_10_OR_LATER(ah) || AR_SREV_9100(ah)) > pCap->rts_aggr_limit = ATH_AMPDU_LIMIT_MAX; > else > pCap->rts_aggr_limit = (8 * 1024); The RTS limit is required for some old chips which had HW bugs and the above code is a workaround. > A timer would be definitely needed, and it should be rather small (1 or > 2 ms) > > If TCP socket is application limited, it seems ath9k can delay the last > block by a too long time. I am not really clear on how this regression can be fixed in the driver since the majority of the transmission/aggregation logic is present in the TX completion path. Sujith ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: TCP performance regression 2013-11-11 8:19 ` Sujith Manoharan @ 2013-11-11 14:27 ` Eric Dumazet 2013-11-11 14:39 ` Eric Dumazet ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Eric Dumazet @ 2013-11-11 14:27 UTC (permalink / raw) To: Sujith Manoharan; +Cc: netdev, Dave Taht On Mon, 2013-11-11 at 13:49 +0530, Sujith Manoharan wrote: > I am not really clear on how this regression can be fixed in the driver > since the majority of the transmission/aggregation logic is present in the > TX completion path. We have many choices. 1) Add back a minimum of ~128 K of outstanding bytes per TCP session, so that buggy drivers can sustain 'line rate'. Note that with 100 concurrent TCP streams, total amount of bytes queued on the NIC is 12 MB. And pfifo_fast qdisc will drop packets anyway. Thats what we call 'BufferBloat' 2) Try lower values like 64K. Still bufferbloat. 3) Fix buggy drivers, using a proper logic, or shorter timers (mvneta case for example) 4) Add a new netdev attribute, so that well behaving NIC drivers do not have to artificially force TCP stack to queue too many bytes in Qdisc/NIC queues. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: TCP performance regression 2013-11-11 14:27 ` Eric Dumazet @ 2013-11-11 14:39 ` Eric Dumazet 2013-11-11 16:44 ` Eric Dumazet 2013-11-11 15:05 ` David Laight 2013-11-11 16:13 ` Sujith Manoharan 2 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2013-11-11 14:39 UTC (permalink / raw) To: Sujith Manoharan, Arnaud Ebalard; +Cc: netdev, Dave Taht, Thomas Petazzoni On Mon, 2013-11-11 at 06:27 -0800, Eric Dumazet wrote: > On Mon, 2013-11-11 at 13:49 +0530, Sujith Manoharan wrote: > > > I am not really clear on how this regression can be fixed in the driver > > since the majority of the transmission/aggregation logic is present in the > > TX completion path. > > We have many choices. > > 1) Add back a minimum of ~128 K of outstanding bytes per TCP session, > so that buggy drivers can sustain 'line rate'. > > Note that with 100 concurrent TCP streams, total amount of bytes > queued on the NIC is 12 MB. > And pfifo_fast qdisc will drop packets anyway. > > Thats what we call 'BufferBloat' > > 2) Try lower values like 64K. Still bufferbloat. > > 3) Fix buggy drivers, using a proper logic, or shorter timers (mvneta > case for example) > > 4) Add a new netdev attribute, so that well behaving NIC drivers do not > have to artificially force TCP stack to queue too many bytes in > Qdisc/NIC queues. How following patch helps mvneta performance on current net-next tree for a single TCP (sending) flow ? diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index 7d99e695a110..002ac464202f 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -172,12 +172,11 @@ /* Various constants */ /* Coalescing */ -#define MVNETA_TXDONE_COAL_PKTS 16 #define MVNETA_RX_COAL_PKTS 32 #define MVNETA_RX_COAL_USEC 100 /* Timer */ -#define MVNETA_TX_DONE_TIMER_PERIOD 10 +#define MVNETA_TX_DONE_TIMER_PERIOD 1 /* Napi polling weight */ #define MVNETA_RX_POLL_WEIGHT 64 @@ -1592,8 +1591,7 @@ out: dev_kfree_skb_any(skb); } - if (txq->count >= MVNETA_TXDONE_COAL_PKTS) - mvneta_txq_done(pp, txq); + mvneta_txq_done(pp, txq); /* If after calling mvneta_txq_done, count equals * frags, we need to set the timer ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: TCP performance regression 2013-11-11 14:39 ` Eric Dumazet @ 2013-11-11 16:44 ` Eric Dumazet 0 siblings, 0 replies; 26+ messages in thread From: Eric Dumazet @ 2013-11-11 16:44 UTC (permalink / raw) To: Sujith Manoharan; +Cc: Arnaud Ebalard, netdev, Dave Taht, Thomas Petazzoni On Mon, 2013-11-11 at 06:39 -0800, Eric Dumazet wrote: > How following patch helps mvneta performance on current net-next tree > for a single TCP (sending) flow ? > v2 (more chance to even compile ;) diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c index 7d99e695a110..e8211277f15d 100644 --- a/drivers/net/ethernet/marvell/mvneta.c +++ b/drivers/net/ethernet/marvell/mvneta.c @@ -172,12 +172,12 @@ /* Various constants */ /* Coalescing */ -#define MVNETA_TXDONE_COAL_PKTS 16 +#define MVNETA_TXDONE_COAL_PKTS 1 #define MVNETA_RX_COAL_PKTS 32 #define MVNETA_RX_COAL_USEC 100 /* Timer */ -#define MVNETA_TX_DONE_TIMER_PERIOD 10 +#define MVNETA_TX_DONE_TIMER_PERIOD 1 /* Napi polling weight */ #define MVNETA_RX_POLL_WEIGHT 64 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* RE: TCP performance regression 2013-11-11 14:27 ` Eric Dumazet 2013-11-11 14:39 ` Eric Dumazet @ 2013-11-11 15:05 ` David Laight 2013-11-11 15:29 ` Eric Dumazet 2013-11-11 16:13 ` Sujith Manoharan 2 siblings, 1 reply; 26+ messages in thread From: David Laight @ 2013-11-11 15:05 UTC (permalink / raw) To: Eric Dumazet, Sujith Manoharan; +Cc: netdev, Dave Taht > On Mon, 2013-11-11 at 13:49 +0530, Sujith Manoharan wrote: > > > I am not really clear on how this regression can be fixed in the driver > > since the majority of the transmission/aggregation logic is present in the > > TX completion path. > > We have many choices. > > 1) Add back a minimum of ~128 K of outstanding bytes per TCP session, > so that buggy drivers can sustain 'line rate'. > > Note that with 100 concurrent TCP streams, total amount of bytes > queued on the NIC is 12 MB. > And pfifo_fast qdisc will drop packets anyway. > > Thats what we call 'BufferBloat' > > 2) Try lower values like 64K. Still bufferbloat. > > 3) Fix buggy drivers, using a proper logic, or shorter timers (mvneta > case for example) > > 4) Add a new netdev attribute, so that well behaving NIC drivers do not > have to artificially force TCP stack to queue too many bytes in > Qdisc/NIC queues. Or, maybe: 5) call skb_orphan() (I think that is the correct function) when transmit packets are given to the hardware. I think that if the mac driver supports BQL this could be done as soon as the BQL resource is assigned to the packet. I suspect this could be done unconditionally. Clearly the skb may also need to be freed to allow protocol retransmissions to complete properly - but that won't be so timing critical. I remember (a long time ago) getting a measurable performance increase by disabling the 'end of transmit' interrupt and only doing tx tidyup when the driver was active for other reasons. There were 2 reasons for enabling the interrupt: 1) tx ring full. 2) tx buffer had a user-defined delete function. David ^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: TCP performance regression 2013-11-11 15:05 ` David Laight @ 2013-11-11 15:29 ` Eric Dumazet 2013-11-11 15:43 ` David Laight 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2013-11-11 15:29 UTC (permalink / raw) To: David Laight; +Cc: Sujith Manoharan, netdev, Dave Taht On Mon, 2013-11-11 at 15:05 +0000, David Laight wrote: > Or, maybe: > 5) call skb_orphan() (I think that is the correct function) when transmit > packets are given to the hardware. This is the worth possible solution, as it basically re-enables bufferbloat again. socket sk_wmem_queued should not be fooled, unless we have no other choice. ^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: TCP performance regression 2013-11-11 15:29 ` Eric Dumazet @ 2013-11-11 15:43 ` David Laight 2013-11-11 16:17 ` Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: David Laight @ 2013-11-11 15:43 UTC (permalink / raw) To: Eric Dumazet; +Cc: Sujith Manoharan, netdev, Dave Taht > > Or, maybe: > > 5) call skb_orphan() (I think that is the correct function) when transmit > > packets are given to the hardware. > > This is the worth possible solution, as it basically re-enables ^^^^^ worst ? > bufferbloat again. It should be ok if the mac driver only gives the hardware a small number of bytes/packets - or one appropriate for the link speed. David ^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: TCP performance regression 2013-11-11 15:43 ` David Laight @ 2013-11-11 16:17 ` Eric Dumazet 2013-11-11 16:35 ` David Laight 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2013-11-11 16:17 UTC (permalink / raw) To: David Laight; +Cc: Sujith Manoharan, netdev, Dave Taht On Mon, 2013-11-11 at 15:43 +0000, David Laight wrote: > > > Or, maybe: > > > 5) call skb_orphan() (I think that is the correct function) when transmit > > > packets are given to the hardware. > > > > This is the worth possible solution, as it basically re-enables > ^^^^^ worst ? > > bufferbloat again. > > It should be ok if the mac driver only gives the hardware a small > number of bytes/packets - or one appropriate for the link speed. There is some confusion here. mvneta has a TX ring buffer, which can hold up to 532 TX descriptors. If this driver used skb_orphan(), a single TCP flow could use the whole TX ring. TCP Small Queue would only limit the number of skbs on Qdisc. Try then to send a ping message, it will have to wait a lot. ^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: TCP performance regression 2013-11-11 16:17 ` Eric Dumazet @ 2013-11-11 16:35 ` David Laight 2013-11-11 17:41 ` Eric Dumazet 2013-11-12 7:42 ` Willy Tarreau 0 siblings, 2 replies; 26+ messages in thread From: David Laight @ 2013-11-11 16:35 UTC (permalink / raw) To: Eric Dumazet; +Cc: Sujith Manoharan, netdev, Dave Taht > > It should be ok if the mac driver only gives the hardware a small > > number of bytes/packets - or one appropriate for the link speed. > > There is some confusion here. > > mvneta has a TX ring buffer, which can hold up to 532 TX descriptors. > > If this driver used skb_orphan(), a single TCP flow could use the whole > TX ring. > > TCP Small Queue would only limit the number of skbs on Qdisc. > > Try then to send a ping message, it will have to wait a lot. 532 is a ridiculously large number especially for a slow interface. At a guess you don't want more than 10-20ms of data in the tx ring. You might need extra descriptors for badly fragmented packets. David ^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: TCP performance regression 2013-11-11 16:35 ` David Laight @ 2013-11-11 17:41 ` Eric Dumazet 2013-11-12 7:42 ` Willy Tarreau 1 sibling, 0 replies; 26+ messages in thread From: Eric Dumazet @ 2013-11-11 17:41 UTC (permalink / raw) To: David Laight; +Cc: Sujith Manoharan, netdev, Dave Taht On Mon, 2013-11-11 at 16:35 +0000, David Laight wrote: > > > It should be ok if the mac driver only gives the hardware a small > > > number of bytes/packets - or one appropriate for the link speed. > > > > There is some confusion here. > > > > mvneta has a TX ring buffer, which can hold up to 532 TX descriptors. > > > > If this driver used skb_orphan(), a single TCP flow could use the whole > > TX ring. > > > > TCP Small Queue would only limit the number of skbs on Qdisc. > > > > Try then to send a ping message, it will have to wait a lot. > > 532 is a ridiculously large number especially for a slow interface. > At a guess you don't want more than 10-20ms of data in the tx ring. > You might need extra descriptors for badly fragmented packets. Thats why we invented BQL. Problem is most driver authors don't care of the problem. They already have hard time to make bug free drivers. BQL is adding pressure and expose long standing bugs. Some drivers have large TX rings to lower race probabilities. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: TCP performance regression 2013-11-11 16:35 ` David Laight 2013-11-11 17:41 ` Eric Dumazet @ 2013-11-12 7:42 ` Willy Tarreau 2013-11-12 14:16 ` Eric Dumazet 2013-11-14 9:54 ` Dave Taht 1 sibling, 2 replies; 26+ messages in thread From: Willy Tarreau @ 2013-11-12 7:42 UTC (permalink / raw) To: David Laight; +Cc: Eric Dumazet, Sujith Manoharan, netdev, Dave Taht On Mon, Nov 11, 2013 at 04:35:30PM -0000, David Laight wrote: > > > It should be ok if the mac driver only gives the hardware a small > > > number of bytes/packets - or one appropriate for the link speed. > > > > There is some confusion here. > > > > mvneta has a TX ring buffer, which can hold up to 532 TX descriptors. > > > > If this driver used skb_orphan(), a single TCP flow could use the whole > > TX ring. > > > > TCP Small Queue would only limit the number of skbs on Qdisc. > > > > Try then to send a ping message, it will have to wait a lot. > > 532 is a ridiculously large number especially for a slow interface. > At a guess you don't want more than 10-20ms of data in the tx ring. Well, it's not *that* large, 532 descriptors is 800 kB or 6.4 ms with 1500-bytes packets, and 273 microseconds for 64-byte packets. In fact it's not a slow interface, it's the systems it runs on which are generally not that fast. For example it is possible to saturate two gig ports at once on a single-core Armada370. But you need buffers large enough to compensate for the context switch time if you use multiple threads to send. Regards, Willy ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: TCP performance regression 2013-11-12 7:42 ` Willy Tarreau @ 2013-11-12 14:16 ` Eric Dumazet 2013-11-14 9:54 ` Dave Taht 1 sibling, 0 replies; 26+ messages in thread From: Eric Dumazet @ 2013-11-12 14:16 UTC (permalink / raw) To: Willy Tarreau; +Cc: David Laight, Sujith Manoharan, netdev, Dave Taht On Tue, 2013-11-12 at 08:42 +0100, Willy Tarreau wrote: > Well, it's not *that* large, 532 descriptors is 800 kB or 6.4 ms with > 1500-bytes packets, and 273 microseconds for 64-byte packets. In fact > it's not a slow interface, it's the systems it runs on which are > generally not that fast. For example it is possible to saturate two > gig ports at once on a single-core Armada370. But you need buffers > large enough to compensate for the context switch time if you use > multiple threads to send. With GSO, each 1500-bytes packet might need 2 descriptors anyway (one for the headers, one for the payload), so 532 descriptors only hold 400 kB or 3.2 ms ;) If the NIC was supporting TSO, this would be another story, as a 64KB packet could use only 3 descriptors. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: TCP performance regression 2013-11-12 7:42 ` Willy Tarreau 2013-11-12 14:16 ` Eric Dumazet @ 2013-11-14 9:54 ` Dave Taht 1 sibling, 0 replies; 26+ messages in thread From: Dave Taht @ 2013-11-14 9:54 UTC (permalink / raw) To: Willy Tarreau Cc: David Laight, Eric Dumazet, Sujith Manoharan, netdev@vger.kernel.org On Mon, Nov 11, 2013 at 11:42 PM, Willy Tarreau <w@1wt.eu> wrote: > On Mon, Nov 11, 2013 at 04:35:30PM -0000, David Laight wrote: >> > > It should be ok if the mac driver only gives the hardware a small >> > > number of bytes/packets - or one appropriate for the link speed. >> > >> > There is some confusion here. >> > >> > mvneta has a TX ring buffer, which can hold up to 532 TX descriptors. >> > >> > If this driver used skb_orphan(), a single TCP flow could use the whole >> > TX ring. >> > >> > TCP Small Queue would only limit the number of skbs on Qdisc. >> > >> > Try then to send a ping message, it will have to wait a lot. >> >> 532 is a ridiculously large number especially for a slow interface. >> At a guess you don't want more than 10-20ms of data in the tx ring. > > Well, it's not *that* large, 532 descriptors is 800 kB or 6.4 ms with > 1500-bytes packets, and 273 microseconds for 64-byte packets. In fact > it's not a slow interface, it's the systems it runs on which are > generally not that fast. For example it is possible to saturate two > gig ports at once on a single-core Armada370. But you need buffers > large enough to compensate for the context switch time if you use > multiple threads to send. There is this terrible tendency to think that all interfaces run at maximum rate, always. There has been an interesting trend towards slower rates of late -Things like the pi and beaglebone black have interfaces that run at 100Mbits (cheaper phy, less power), and thus communication from a armada370 in this case, at line rate, would induce up to 64ms of delay. A 10Mbit interface, 640ms. Many devices that connect to the internet run at these lower speeds. BQL can hold that down to something reasonable at a wide range of line rates on ethernet. In the context of 802.11 wireless, the rate problem is much, much worse, going down to 1Mbit, and never getting as high as a gig, and often massively extending things with exorbitant retries and retransmits. Although Eric fixed "the regression" on the new fq stuff vs a vs the ath10k and ath9k, I would really have liked a set of benchmarks of the ath10k and ath9k device and driver at realistic rates like MCS1 and MCS4, to make more clear the problems those devices have at real world, rather than lab, transmission rates. > > Regards, > Willy > -- Dave Täht Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.html ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: TCP performance regression 2013-11-11 14:27 ` Eric Dumazet 2013-11-11 14:39 ` Eric Dumazet 2013-11-11 15:05 ` David Laight @ 2013-11-11 16:13 ` Sujith Manoharan 2013-11-11 16:38 ` Felix Fietkau 2 siblings, 1 reply; 26+ messages in thread From: Sujith Manoharan @ 2013-11-11 16:13 UTC (permalink / raw) To: Eric Dumazet; +Cc: Felix Fietkau, netdev, Dave Taht Eric Dumazet wrote: > We have many choices. > > 1) Add back a minimum of ~128 K of outstanding bytes per TCP session, > so that buggy drivers can sustain 'line rate'. > > Note that with 100 concurrent TCP streams, total amount of bytes > queued on the NIC is 12 MB. > And pfifo_fast qdisc will drop packets anyway. > > Thats what we call 'BufferBloat' > > 2) Try lower values like 64K. Still bufferbloat. > > 3) Fix buggy drivers, using a proper logic, or shorter timers (mvneta > case for example) > > 4) Add a new netdev attribute, so that well behaving NIC drivers do not > have to artificially force TCP stack to queue too many bytes in > Qdisc/NIC queues. I think the quirks of 802.11 aggregation should be taken into account. I am adding Felix to this thread, who would have more to say on latency/bufferbloat with wireless drivers. Sujith ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: TCP performance regression 2013-11-11 16:13 ` Sujith Manoharan @ 2013-11-11 16:38 ` Felix Fietkau 2013-11-11 17:38 ` Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: Felix Fietkau @ 2013-11-11 16:38 UTC (permalink / raw) To: Sujith Manoharan, Eric Dumazet; +Cc: netdev, Dave Taht On 2013-11-11 17:13, Sujith Manoharan wrote: > Eric Dumazet wrote: >> We have many choices. >> >> 1) Add back a minimum of ~128 K of outstanding bytes per TCP session, >> so that buggy drivers can sustain 'line rate'. >> >> Note that with 100 concurrent TCP streams, total amount of bytes >> queued on the NIC is 12 MB. >> And pfifo_fast qdisc will drop packets anyway. >> >> Thats what we call 'BufferBloat' >> >> 2) Try lower values like 64K. Still bufferbloat. >> >> 3) Fix buggy drivers, using a proper logic, or shorter timers (mvneta >> case for example) >> >> 4) Add a new netdev attribute, so that well behaving NIC drivers do not >> have to artificially force TCP stack to queue too many bytes in >> Qdisc/NIC queues. > > I think the quirks of 802.11 aggregation should be taken into account. > I am adding Felix to this thread, who would have more to say on latency/bufferbloat > with wireless drivers. I don't think this issue is about something as simple as timer handling for tx completion (or even broken/buggy drivers). There's simply no way to make 802.11 aggregation work well and have similar tx completion latency characteristics as Ethernet devices. 802.11 aggregation reduces the per-packet airtime overhead by combining multiple packets into one transmission (saving a lot of time getting a tx opportunity, transmitting the PHY header, etc.), which makes the 'line rate' heavily depend on the amount of buffering. Aggregating multiple packets into one transmission also causes extra packet loss, which is compensated by retransmission and reordering, thus introducing additional latency. I don't think that TSQ can do a decent job of mitigating bufferbloat on 802.11n devices without a significant performance hit, so adding a new netdev attribute might be a good idea. - Felix ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: TCP performance regression 2013-11-11 16:38 ` Felix Fietkau @ 2013-11-11 17:38 ` Eric Dumazet 2013-11-11 17:44 ` Felix Fietkau 2013-11-11 18:03 ` Dave Taht 0 siblings, 2 replies; 26+ messages in thread From: Eric Dumazet @ 2013-11-11 17:38 UTC (permalink / raw) To: Felix Fietkau; +Cc: Sujith Manoharan, netdev, Dave Taht On Mon, 2013-11-11 at 17:38 +0100, Felix Fietkau wrote: > On 2013-11-11 17:13, Sujith Manoharan wrote: > > Eric Dumazet wrote: > >> We have many choices. > >> > >> 1) Add back a minimum of ~128 K of outstanding bytes per TCP session, > >> so that buggy drivers can sustain 'line rate'. > >> > >> Note that with 100 concurrent TCP streams, total amount of bytes > >> queued on the NIC is 12 MB. > >> And pfifo_fast qdisc will drop packets anyway. > >> > >> Thats what we call 'BufferBloat' > >> > >> 2) Try lower values like 64K. Still bufferbloat. > >> > >> 3) Fix buggy drivers, using a proper logic, or shorter timers (mvneta > >> case for example) > >> > >> 4) Add a new netdev attribute, so that well behaving NIC drivers do not > >> have to artificially force TCP stack to queue too many bytes in > >> Qdisc/NIC queues. > > > > I think the quirks of 802.11 aggregation should be taken into account. > > I am adding Felix to this thread, who would have more to say on latency/bufferbloat > > with wireless drivers. > I don't think this issue is about something as simple as timer handling > for tx completion (or even broken/buggy drivers). > > There's simply no way to make 802.11 aggregation work well and have > similar tx completion latency characteristics as Ethernet devices. > > 802.11 aggregation reduces the per-packet airtime overhead by combining > multiple packets into one transmission (saving a lot of time getting a > tx opportunity, transmitting the PHY header, etc.), which makes the > 'line rate' heavily depend on the amount of buffering. How long a TX packet is put on hold hoping a following packet will come ? > Aggregating multiple packets into one transmission also causes extra > packet loss, which is compensated by retransmission and reordering, thus > introducing additional latency. > > I don't think that TSQ can do a decent job of mitigating bufferbloat on > 802.11n devices without a significant performance hit, so adding a new > netdev attribute might be a good idea. The netdev attribute would work, but might not work well if using a tunnel... ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: TCP performance regression 2013-11-11 17:38 ` Eric Dumazet @ 2013-11-11 17:44 ` Felix Fietkau 2013-11-11 18:03 ` Dave Taht 1 sibling, 0 replies; 26+ messages in thread From: Felix Fietkau @ 2013-11-11 17:44 UTC (permalink / raw) To: Eric Dumazet; +Cc: Sujith Manoharan, netdev, Dave Taht On 2013-11-11 18:38, Eric Dumazet wrote: > On Mon, 2013-11-11 at 17:38 +0100, Felix Fietkau wrote: >> I don't think this issue is about something as simple as timer handling >> for tx completion (or even broken/buggy drivers). >> >> There's simply no way to make 802.11 aggregation work well and have >> similar tx completion latency characteristics as Ethernet devices. >> >> 802.11 aggregation reduces the per-packet airtime overhead by combining >> multiple packets into one transmission (saving a lot of time getting a >> tx opportunity, transmitting the PHY header, etc.), which makes the >> 'line rate' heavily depend on the amount of buffering. > > How long a TX packet is put on hold hoping a following packet will > come ? TX packets in the aggregation queue are held as long as the hardware queue holds two A-MPDUs (each of which can contain up to 32 packets). If the aggregation queues are empty and the hardware queue is not full, the next tx packet from the network stack is pushed to the hardware queue immediately. - Felix ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: TCP performance regression 2013-11-11 17:38 ` Eric Dumazet 2013-11-11 17:44 ` Felix Fietkau @ 2013-11-11 18:03 ` Dave Taht 2013-11-11 18:29 ` Sujith Manoharan 2013-11-11 18:31 ` Dave Taht 1 sibling, 2 replies; 26+ messages in thread From: Dave Taht @ 2013-11-11 18:03 UTC (permalink / raw) To: Eric Dumazet; +Cc: Felix Fietkau, Sujith Manoharan, netdev@vger.kernel.org On Mon, Nov 11, 2013 at 9:38 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Mon, 2013-11-11 at 17:38 +0100, Felix Fietkau wrote: >> On 2013-11-11 17:13, Sujith Manoharan wrote: >> > Eric Dumazet wrote: >> >> We have many choices. >> >> >> >> 1) Add back a minimum of ~128 K of outstanding bytes per TCP session, >> >> so that buggy drivers can sustain 'line rate'. >> >> >> >> Note that with 100 concurrent TCP streams, total amount of bytes >> >> queued on the NIC is 12 MB. >> >> And pfifo_fast qdisc will drop packets anyway. >> >> >> >> Thats what we call 'BufferBloat' >> >> >> >> 2) Try lower values like 64K. Still bufferbloat. >> >> >> >> 3) Fix buggy drivers, using a proper logic, or shorter timers (mvneta >> >> case for example) >> >> >> >> 4) Add a new netdev attribute, so that well behaving NIC drivers do not >> >> have to artificially force TCP stack to queue too many bytes in >> >> Qdisc/NIC queues. >> > >> > I think the quirks of 802.11 aggregation should be taken into account. >> > I am adding Felix to this thread, who would have more to say on latency/bufferbloat >> > with wireless drivers. As I just got dropped in the middle of this convo, I tend to think that the mac80211 questions is should be handled in it's own thread as this conversation seemed to be about a certain ethernet driver's flaws. >> I don't think this issue is about something as simple as timer handling >> for tx completion (or even broken/buggy drivers). >> >> There's simply no way to make 802.11 aggregation work well and have >> similar tx completion latency characteristics as Ethernet devices. I don't quite share all of felix's pessimism. It will tend to be burstier, yes, but I felt that would not look that much different than napi -> BQL. >> 802.11 aggregation reduces the per-packet airtime overhead by combining >> multiple packets into one transmission (saving a lot of time getting a >> tx opportunity, transmitting the PHY header, etc.), which makes the >> 'line rate' heavily depend on the amount of buffering. making aggregation work well is key to fixing wifi worldwide. Presently aggregation performance is pretty universally terrible under real loads and tcp. (looking further ahead, getting multi-user mimo to work in 802.11ac would also be helpful but I'm not even sure the IEEE figured that out yet. Ath10k hw2 do it?) > How long a TX packet is put on hold hoping a following packet will > come ? > > > >> Aggregating multiple packets into one transmission also causes extra >> packet loss, which is compensated by retransmission and reordering, thus >> introducing additional latency. I was extremely encouraged by Yucheng's presentation at ietf on some vast improvements on managing re-ordering problems. I daydreamed that it would become possible to eliminate the reorder buffer in lower levels of the wireless stack(s?). See slides and fantasize: http://www.ietf.org/proceedings/88/slides/slides-88-iccrg-6.pdf The rest of the preso was good, too. I also thought the new pacing stuff would cause trouble in wifi and aggregation. >> I don't think that TSQ can do a decent job of mitigating bufferbloat on >> 802.11n devices without a significant performance hit, so adding a new >> netdev attribute might be a good idea. I am not sure which part of what subsystem(s) is really under debate here. TSQ limits the number of packets that can be outstanding in a stream. The characteristics of a wifi connection (EDCA scheduling and aggregated batching) play merry hell with TCP assumptions. The recent work on fixing TSO offloads shows what can happen if that underlying set of assumptions is fixed. My overall take on this, tho, is to take the latest bits of TSQ and "fq" code, and go measure the effect on wifi stations rather than discuss what layer is busted or what options need to be added to netdev. Has anyone done that? I've been busy with 3.10.x Personally I don't have much of a problem if TSQ hurts single stream TCP throughput on wifi. I would vastly prefer aggregation to work better for multiple streams with vastly smaller buffers than it does. That would be a bigger win, overall. > The netdev attribute would work, but might not work well if using a > tunnel... I am going to make some coffee and catch up. Please excuse whatever noise I just introduced. > > > -- Dave Täht Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.html ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: TCP performance regression 2013-11-11 18:03 ` Dave Taht @ 2013-11-11 18:29 ` Sujith Manoharan 2013-11-11 18:31 ` Dave Taht 1 sibling, 0 replies; 26+ messages in thread From: Sujith Manoharan @ 2013-11-11 18:29 UTC (permalink / raw) To: Dave Taht; +Cc: Eric Dumazet, Felix Fietkau, netdev@vger.kernel.org Dave Taht wrote: > Personally I don't have much of a problem if TSQ hurts single stream > TCP throughput on wifi. I would vastly prefer aggregation to work > better for multiple streams with vastly smaller buffers than it does. > That would be a bigger win, overall. ath9k doesn't hold very deep queues for aggregated traffic. A maximum of 128 packets can be buffered for each Access Class queue and still good throughput is obtained, even for 3x3 scenarios. A loss of almost 50% throughput is seen in 1x1 setups and the penalty becomes higher with more streams. I don't think such a big loss in performance is acceptable to achieve low latency. Sujith ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: TCP performance regression 2013-11-11 18:03 ` Dave Taht 2013-11-11 18:29 ` Sujith Manoharan @ 2013-11-11 18:31 ` Dave Taht 2013-11-11 19:11 ` Ben Greear 1 sibling, 1 reply; 26+ messages in thread From: Dave Taht @ 2013-11-11 18:31 UTC (permalink / raw) To: Eric Dumazet Cc: Felix Fietkau, Sujith Manoharan, netdev@vger.kernel.org, Avery Pennarun Ah, this thread started with a huge regression in ath10k performance with the new TSQ stuff, and isn't actually about a two line fix to the mv ethernet driver. http://comments.gmane.org/gmane.linux.network/290269 I suddenly care a lot more. And I'll care a lot, lot, lot more, if someone can post a rrul test for before and after the new fq scheduler and tsq change on this driver on this hardware... What, if anything, in terms of improvements or regressions, happened to multi-stream throughput and latency? https://github.com/tohojo/netperf-wrapper ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: TCP performance regression 2013-11-11 18:31 ` Dave Taht @ 2013-11-11 19:11 ` Ben Greear 2013-11-11 19:24 ` Dave Taht 0 siblings, 1 reply; 26+ messages in thread From: Ben Greear @ 2013-11-11 19:11 UTC (permalink / raw) To: Dave Taht Cc: Eric Dumazet, Felix Fietkau, Sujith Manoharan, netdev@vger.kernel.org, Avery Pennarun On 11/11/2013 10:31 AM, Dave Taht wrote: > Ah, this thread started with a huge regression in ath10k performance > with the new TSQ stuff, and isn't actually about a two line fix to the > mv ethernet driver. > > http://comments.gmane.org/gmane.linux.network/290269 > > I suddenly care a lot more. And I'll care a lot, lot, lot more, if > someone can post a rrul test for before and after the new fq scheduler > and tsq change on this driver on this hardware... What, if anything, > in terms of improvements or regressions, happened to multi-stream > throughput and latency? > > https://github.com/tohojo/netperf-wrapper Not directly related, but we have run some automated tests against an older buffer-bloat enabled AP (not ath10k hardware, don't know the exact details at the moment), and in general the performance is horrible compared to all of the other APs we test against. Our tests are concerned mostly with throughput. For reference, here are some graphs with supplicant/hostapd running on higher-end x86-64 hardware and ath9k: http://www.candelatech.com/lf_wifi_examples.php We see somewhat similar results with most commercial APs, though often they max out at 128 or fewer stations instead of the several hundred we get on our own AP configs. We'll update to more recent buffer-bloat AP software and post some results when we get a chance. Thanks, Ben -- Ben Greear <greearb@candelatech.com> Candela Technologies Inc http://www.candelatech.com ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: TCP performance regression 2013-11-11 19:11 ` Ben Greear @ 2013-11-11 19:24 ` Dave Taht 0 siblings, 0 replies; 26+ messages in thread From: Dave Taht @ 2013-11-11 19:24 UTC (permalink / raw) To: Ben Greear Cc: Eric Dumazet, Felix Fietkau, Sujith Manoharan, netdev@vger.kernel.org, Avery Pennarun On Mon, Nov 11, 2013 at 11:11 AM, Ben Greear <greearb@candelatech.com> wrote: > On 11/11/2013 10:31 AM, Dave Taht wrote: >> Ah, this thread started with a huge regression in ath10k performance >> with the new TSQ stuff, and isn't actually about a two line fix to the >> mv ethernet driver. >> >> http://comments.gmane.org/gmane.linux.network/290269 >> >> I suddenly care a lot more. And I'll care a lot, lot, lot more, if >> someone can post a rrul test for before and after the new fq scheduler >> and tsq change on this driver on this hardware... What, if anything, >> in terms of improvements or regressions, happened to multi-stream >> throughput and latency? >> >> https://github.com/tohojo/netperf-wrapper > > Not directly related, but we have run some automated tests against > an older buffer-bloat enabled AP (not ath10k hardware, don't know the > exact details at the moment), and in general the performance > is horrible compared to all of the other APs we test against. I was not happy with the dlink product and the streamboost implementation, if that is what it was. > Our tests are concerned mostly with throughput. :( > For reference, here are some graphs with supplicant/hostapd > running on higher-end x86-64 hardware and ath9k: > > http://www.candelatech.com/lf_wifi_examples.php > > We see somewhat similar results with most commercial APs, though > often they max out at 128 or fewer stations instead of the several > hundred we get on our own AP configs. > > We'll update to more recent buffer-bloat AP software and post some > results when we get a chance. Are you talking cerowrt (on the wndr3800) here? I am well aware that it doesn't presently scale well with large numbers of clients, which is awaiting the per-sta queue work. (most of the work to date has been on the aqm-to-the-universe code) This is the most recent stable firmware for that: http://snapon.lab.bufferbloat.net/~cero2/cerowrt/wndr/3.10.17-6/ I just did 3.10.18 but haven't tested it. Cero also runs HT20 by default, and there are numerous other things that are configured more for "science" than throughput. Notably the size of the aggregation queues is limited. But I'd LOVE a test through your suite. I note I'd also love to see TCP tests through your suite with the AP configured thusly (server) - (100ms delay box running a recent netem and a packet limit of 100000+) - AP (w 1000 packets buffering/wo AQM, and with AQM) - (wifi clients) (and will gladly help set that up. Darn, I just drove past your offices) > > Thanks, > Ben > > > -- > Ben Greear <greearb@candelatech.com> > Candela Technologies Inc http://www.candelatech.com > -- Dave Täht ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2013-11-14 9:54 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-11-11 5:30 TCP performance regression Sujith Manoharan 2013-11-11 5:55 ` Eric Dumazet 2013-11-11 6:07 ` Sujith Manoharan 2013-11-11 6:54 ` Eric Dumazet 2013-11-11 8:19 ` Sujith Manoharan 2013-11-11 14:27 ` Eric Dumazet 2013-11-11 14:39 ` Eric Dumazet 2013-11-11 16:44 ` Eric Dumazet 2013-11-11 15:05 ` David Laight 2013-11-11 15:29 ` Eric Dumazet 2013-11-11 15:43 ` David Laight 2013-11-11 16:17 ` Eric Dumazet 2013-11-11 16:35 ` David Laight 2013-11-11 17:41 ` Eric Dumazet 2013-11-12 7:42 ` Willy Tarreau 2013-11-12 14:16 ` Eric Dumazet 2013-11-14 9:54 ` Dave Taht 2013-11-11 16:13 ` Sujith Manoharan 2013-11-11 16:38 ` Felix Fietkau 2013-11-11 17:38 ` Eric Dumazet 2013-11-11 17:44 ` Felix Fietkau 2013-11-11 18:03 ` Dave Taht 2013-11-11 18:29 ` Sujith Manoharan 2013-11-11 18:31 ` Dave Taht 2013-11-11 19:11 ` Ben Greear 2013-11-11 19:24 ` Dave Taht
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).