All of lore.kernel.org
 help / color / mirror / Atom feed
From: george.dunlap@eu.citrix.com (George Dunlap)
To: linux-arm-kernel@lists.infradead.org
Subject: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
Date: Wed, 15 Apr 2015 18:23:25 +0100	[thread overview]
Message-ID: <552E9E8D.1080000@eu.citrix.com> (raw)
In-Reply-To: <1429115934.7346.107.camel@edumazet-glaptop2.roam.corp.google.com>

On 04/15/2015 05:38 PM, Eric Dumazet wrote:
> My thoughts that instead of these long talks you should guys read the
> code :
> 
>                 /* TCP Small Queues :
>                  * Control number of packets in qdisc/devices to two packets / or ~1 ms.
>                  * This allows for :
>                  *  - better RTT estimation and ACK scheduling
>                  *  - faster recovery
>                  *  - high rates
>                  * Alas, some drivers / subsystems require a fair amount
>                  * of queued bytes to ensure line rate.
>                  * One example is wifi aggregation (802.11 AMPDU)
>                  */
>                 limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
>                 limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
> 
> 
> Then you'll see that most of your questions are already answered.
> 
> Feel free to try to improve the behavior, if it does not hurt critical workloads
> like TCP_RR, where we we send very small messages, millions times per second.

First of all, with regard to critical workloads, once this patch gets
into distros, *normal TCP streams* on every VM running on Amazon,
Rackspace, Linode, &c will get a 30% hit in performance *by default*.
Normal TCP streams on xennet *are* a critical workload, and deserve the
same kind of accommodation as TCP_RR (if not more).  The same goes for
virtio_net.

Secondly, according to Stefano's and Jonathan's tests,
tcp_limit_output_bytes completely fixes the problem for Xen.

Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
*already* larger for Xen; that calculation mentioned in the comment is
*already* doing the right thing.

As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an
automatic TSQ calculation which is actually choosing an effective value
for xennet.

It certainly makes sense for sysctl_tcp_limit_output_bytes to be an
actual maximum limit.  I went back and looked at the original patch
which introduced it (46d3ceabd), and it looks to me like it was designed
to be a rough, quick estimate of "two packets outstanding" (by choosing
the maximum size of the packet, 64k, and multiplying it by two).

Now that you have a better algorithm -- the size of 2 actual packets or
the amount transmitted in 1ms -- it seems like the default
sysctl_tcp_limit_output_bytes should be higher, and let the automatic
TSQ you have on the first line throttle things down when necessary.

 -George

WARNING: multiple messages have this Message-ID (diff)
From: George Dunlap <george.dunlap@eu.citrix.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jonathan Davies <Jonathan.Davies@citrix.com>,
	"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	Wei Liu <wei.liu2@citrix.com>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	"Stefano Stabellini" <stefano.stabellini@eu.citrix.com>,
	netdev <netdev@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Eric Dumazet <edumazet@google.com>,
	"Paul Durrant" <paul.durrant@citrix.com>,
	Christoffer Dall <christoffer.dall@linaro.org>,
	Felipe Franciosi <felipe.franciosi@citrix.com>,
	<linux-arm-kernel@lists.infradead.org>,
	"David Vrabel" <david.vrabel@citrix.com>
Subject: Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
Date: Wed, 15 Apr 2015 18:23:25 +0100	[thread overview]
Message-ID: <552E9E8D.1080000@eu.citrix.com> (raw)
In-Reply-To: <1429115934.7346.107.camel@edumazet-glaptop2.roam.corp.google.com>

On 04/15/2015 05:38 PM, Eric Dumazet wrote:
> My thoughts that instead of these long talks you should guys read the
> code :
> 
>                 /* TCP Small Queues :
>                  * Control number of packets in qdisc/devices to two packets / or ~1 ms.
>                  * This allows for :
>                  *  - better RTT estimation and ACK scheduling
>                  *  - faster recovery
>                  *  - high rates
>                  * Alas, some drivers / subsystems require a fair amount
>                  * of queued bytes to ensure line rate.
>                  * One example is wifi aggregation (802.11 AMPDU)
>                  */
>                 limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
>                 limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
> 
> 
> Then you'll see that most of your questions are already answered.
> 
> Feel free to try to improve the behavior, if it does not hurt critical workloads
> like TCP_RR, where we we send very small messages, millions times per second.

First of all, with regard to critical workloads, once this patch gets
into distros, *normal TCP streams* on every VM running on Amazon,
Rackspace, Linode, &c will get a 30% hit in performance *by default*.
Normal TCP streams on xennet *are* a critical workload, and deserve the
same kind of accommodation as TCP_RR (if not more).  The same goes for
virtio_net.

Secondly, according to Stefano's and Jonathan's tests,
tcp_limit_output_bytes completely fixes the problem for Xen.

Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
*already* larger for Xen; that calculation mentioned in the comment is
*already* doing the right thing.

As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an
automatic TSQ calculation which is actually choosing an effective value
for xennet.

It certainly makes sense for sysctl_tcp_limit_output_bytes to be an
actual maximum limit.  I went back and looked at the original patch
which introduced it (46d3ceabd), and it looks to me like it was designed
to be a rough, quick estimate of "two packets outstanding" (by choosing
the maximum size of the packet, 64k, and multiplying it by two).

Now that you have a better algorithm -- the size of 2 actual packets or
the amount transmitted in 1ms -- it seems like the default
sysctl_tcp_limit_output_bytes should be higher, and let the automatic
TSQ you have on the first line throttle things down when necessary.

 -George

WARNING: multiple messages have this Message-ID (diff)
From: George Dunlap <george.dunlap@eu.citrix.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jonathan Davies <Jonathan.Davies@citrix.com>,
	"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	Wei Liu <wei.liu2@citrix.com>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Stefano Stabellini <stefano.stabellini@eu.citrix.com>,
	netdev <netdev@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Eric Dumazet <edumazet@google.com>,
	Paul Durrant <paul.durrant@citrix.com>,
	Christoffer Dall <christoffer.dall@linaro.org>,
	Felipe Franciosi <felipe.franciosi@citrix.com>,
	linux-arm-kernel@lists.infradead.org,
	David Vrabel <david.vrabel@citrix.com>
Subject: Re: [Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen
Date: Wed, 15 Apr 2015 18:23:25 +0100	[thread overview]
Message-ID: <552E9E8D.1080000@eu.citrix.com> (raw)
In-Reply-To: <1429115934.7346.107.camel@edumazet-glaptop2.roam.corp.google.com>

On 04/15/2015 05:38 PM, Eric Dumazet wrote:
> My thoughts that instead of these long talks you should guys read the
> code :
> 
>                 /* TCP Small Queues :
>                  * Control number of packets in qdisc/devices to two packets / or ~1 ms.
>                  * This allows for :
>                  *  - better RTT estimation and ACK scheduling
>                  *  - faster recovery
>                  *  - high rates
>                  * Alas, some drivers / subsystems require a fair amount
>                  * of queued bytes to ensure line rate.
>                  * One example is wifi aggregation (802.11 AMPDU)
>                  */
>                 limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
>                 limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
> 
> 
> Then you'll see that most of your questions are already answered.
> 
> Feel free to try to improve the behavior, if it does not hurt critical workloads
> like TCP_RR, where we we send very small messages, millions times per second.

First of all, with regard to critical workloads, once this patch gets
into distros, *normal TCP streams* on every VM running on Amazon,
Rackspace, Linode, &c will get a 30% hit in performance *by default*.
Normal TCP streams on xennet *are* a critical workload, and deserve the
same kind of accommodation as TCP_RR (if not more).  The same goes for
virtio_net.

Secondly, according to Stefano's and Jonathan's tests,
tcp_limit_output_bytes completely fixes the problem for Xen.

Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
*already* larger for Xen; that calculation mentioned in the comment is
*already* doing the right thing.

As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an
automatic TSQ calculation which is actually choosing an effective value
for xennet.

It certainly makes sense for sysctl_tcp_limit_output_bytes to be an
actual maximum limit.  I went back and looked at the original patch
which introduced it (46d3ceabd), and it looks to me like it was designed
to be a rough, quick estimate of "two packets outstanding" (by choosing
the maximum size of the packet, 64k, and multiplying it by two).

Now that you have a better algorithm -- the size of 2 actual packets or
the amount transmitted in 1ms -- it seems like the default
sysctl_tcp_limit_output_bytes should be higher, and let the automatic
TSQ you have on the first line throttle things down when necessary.

 -George

  reply	other threads:[~2015-04-15 17:23 UTC|newest]

Thread overview: 91+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-09 15:46 "tcp: refine TSO autosizing" causes performance regression on Xen Stefano Stabellini
2015-04-09 15:46 ` Stefano Stabellini
2015-04-09 15:46 ` Stefano Stabellini
2015-04-09 16:16 ` Eric Dumazet
2015-04-09 16:16   ` Eric Dumazet
2015-04-09 16:36   ` Stefano Stabellini
2015-04-09 16:36     ` Stefano Stabellini
2015-04-09 16:36     ` Stefano Stabellini
2015-04-09 17:07     ` Eric Dumazet
2015-04-09 17:07       ` Eric Dumazet
2015-04-13 10:56     ` [Xen-devel] " George Dunlap
2015-04-13 10:56       ` George Dunlap
2015-04-13 13:38       ` Jonathan Davies
2015-04-13 13:38         ` Jonathan Davies
2015-04-13 13:38         ` Jonathan Davies
2015-04-13 13:49       ` Eric Dumazet
2015-04-13 13:49         ` Eric Dumazet
2015-04-15 13:43         ` George Dunlap
2015-04-15 13:43           ` George Dunlap
2015-04-15 16:38           ` Eric Dumazet
2015-04-15 16:38             ` Eric Dumazet
2015-04-15 16:38             ` Eric Dumazet
2015-04-15 17:23             ` George Dunlap [this message]
2015-04-15 17:23               ` George Dunlap
2015-04-15 17:23               ` George Dunlap
2015-04-15 17:29               ` Eric Dumazet
2015-04-15 17:29                 ` Eric Dumazet
2015-04-15 17:41                 ` George Dunlap
2015-04-15 17:41                   ` George Dunlap
2015-04-15 17:41                   ` George Dunlap
2015-04-15 17:52                   ` Eric Dumazet
2015-04-15 17:52                     ` Eric Dumazet
2015-04-15 17:55                     ` Rick Jones
2015-04-15 17:55                       ` Rick Jones
2015-04-15 18:08                       ` Eric Dumazet
2015-04-15 18:08                         ` Eric Dumazet
2015-04-15 18:19                         ` Rick Jones
2015-04-15 18:19                           ` Rick Jones
2015-04-15 18:32                           ` Eric Dumazet
2015-04-15 18:32                             ` Eric Dumazet
2015-04-15 18:32                             ` [Xen-devel] " Eric Dumazet
2015-04-15 20:08                             ` Rick Jones
2015-04-15 20:08                               ` Rick Jones
2015-04-15 20:08                               ` Rick Jones
2015-04-15 18:04                     ` George Dunlap
2015-04-15 18:04                       ` George Dunlap
2015-04-15 18:04                       ` George Dunlap
2015-04-15 18:19                       ` Eric Dumazet
2015-04-15 18:19                         ` Eric Dumazet
2015-04-16  8:56                         ` George Dunlap
2015-04-16  8:56                           ` George Dunlap
2015-04-16  8:56                           ` George Dunlap
2015-04-16  9:20                           ` Daniel Borkmann
2015-04-16  9:20                             ` Daniel Borkmann
2015-04-16  9:20                             ` Daniel Borkmann
2015-04-16 10:01                             ` George Dunlap
2015-04-16 10:01                               ` George Dunlap
2015-04-16 10:01                               ` George Dunlap
2015-04-16 12:42                               ` Eric Dumazet
2015-04-16 12:42                                 ` Eric Dumazet
2015-04-20 11:03                                 ` George Dunlap
2015-04-20 11:03                                   ` George Dunlap
2015-06-02  9:52                                 ` Wei Liu
2015-06-02  9:52                                   ` Wei Liu
2015-06-02  9:52                                   ` Wei Liu
2015-06-02 16:16                                   ` Eric Dumazet
2015-06-02 16:16                                     ` Eric Dumazet
2015-04-16  9:22                           ` David Laight
2015-04-16  9:22                             ` David Laight
2015-04-16  9:22                             ` David Laight
2015-04-16 10:57                             ` George Dunlap
2015-04-16 10:57                               ` George Dunlap
2015-04-15 17:41               ` Eric Dumazet
2015-04-15 17:41                 ` Eric Dumazet
2015-04-15 17:58                 ` Stefano Stabellini
2015-04-15 17:58                   ` Stefano Stabellini
2015-04-15 17:58                   ` Stefano Stabellini
2015-04-15 18:17                   ` Eric Dumazet
2015-04-15 18:17                     ` Eric Dumazet
2015-04-16  4:20                     ` Herbert Xu
2015-04-16  4:20                       ` Herbert Xu
2015-04-16  4:30                       ` Eric Dumazet
2015-04-16  4:30                         ` Eric Dumazet
2015-04-16 11:39                     ` George Dunlap
2015-04-16 11:39                       ` George Dunlap
2015-04-16 11:39                       ` George Dunlap
2015-04-16 12:16                       ` Eric Dumazet
2015-04-16 12:16                         ` Eric Dumazet
2015-04-16 13:00                       ` Tim Deegan
2015-04-16 13:00                         ` Tim Deegan
2015-04-16 13:00                         ` Tim Deegan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=552E9E8D.1080000@eu.citrix.com \
    --to=george.dunlap@eu.citrix.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.