[PATCH net-next] tcp: TSO packets automatic sizing

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next] tcp: TSO packets automatic sizing
@ 2013-08-24  0:29 Eric Dumazet
  2013-08-24  3:17 ` Neal Cardwell
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Eric Dumazet @ 2013-08-24  0:29 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Neal Cardwell, Yuchung Cheng, Van Jacobson, Tom Herbert

From: Eric Dumazet <edumazet@google.com>

After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.

One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.

This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.

This field could be set by other transports.

Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.

For other flows, this helps better packet scheduling and ACK clocking.

This patch increases performance of TCP flows in lossy environments.

A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).

A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.

This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.

sk_pacing_rate = 2 * cwnd * mss / srtt
 
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Tom Herbert <therbert@google.com>
---
Google-Bug-Id: 8662219

 Documentation/networking/ip-sysctl.txt |    9 +++++++
 include/net/sock.h                     |    2 +
 include/net/tcp.h                      |    1 
 net/ipv4/sysctl_net_ipv4.c             |   10 ++++++++
 net/ipv4/tcp.c                         |   28 ++++++++++++++++++-----
 net/ipv4/tcp_input.c                   |   17 +++++++++++++
 6 files changed, 62 insertions(+), 5 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index debfe85..ce5bb43 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -482,6 +482,15 @@ tcp_syn_retries - INTEGER
 tcp_timestamps - BOOLEAN
 	Enable timestamps as defined in RFC1323.
 
+tcp_min_tso_segs - INTEGER
+	Minimal number of segments per TCP TSO frame.
+	Since linux-3.12, TCP does an automatic sizing of TSO frames,
+	depending on flow rate, instead of filling 64Kbytes packets.
+	For specific usages, it's possible to force TCP to build big
+	TSO frames. Note that TCP stack might split too big TSO packets
+	if available congestion window is too small.
+	Default: 2
+
 tcp_tso_win_divisor - INTEGER
 	This allows control over what percentage of the congestion window
 	can be consumed by a single TSO frame.
diff --git a/include/net/sock.h b/include/net/sock.h
index e4bbcbf..6ba2e7b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -232,6 +232,7 @@ struct cg_proto;
   *	@sk_napi_id: id of the last napi context to receive data for sk
   *	@sk_ll_usec: usecs to busypoll when there is no data
   *	@sk_allocation: allocation mode
+  *	@sk_pacing_rate: Pacing rate (if supported by transport/packet scheduler)
   *	@sk_sndbuf: size of send buffer in bytes
   *	@sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
   *		   %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
@@ -361,6 +362,7 @@ struct sock {
 	kmemcheck_bitfield_end(flags);
 	int			sk_wmem_queued;
 	gfp_t			sk_allocation;
+	u32			sk_pacing_rate; /* bytes per second */
 	netdev_features_t	sk_route_caps;
 	netdev_features_t	sk_route_nocaps;
 	int			sk_gso_type;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 09cb5c1..73fcd7c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -281,6 +281,7 @@ extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
 extern unsigned int sysctl_tcp_notsent_lowat;
+extern int sysctl_tcp_min_tso_segs;
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 8ed7c32..540279f 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -29,6 +29,7 @@
 static int zero;
 static int one = 1;
 static int four = 4;
+static int gso_max_segs = GSO_MAX_SEGS;
 static int tcp_retr1_max = 255;
 static int ip_local_port_range_min[] = { 1, 1 };
 static int ip_local_port_range_max[] = { 65535, 65535 };
@@ -761,6 +762,15 @@ static struct ctl_table ipv4_table[] = {
 		.extra2		= &four,
 	},
 	{
+		.procname	= "tcp_min_tso_segs",
+		.data		= &sysctl_tcp_min_tso_segs,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &gso_max_segs,
+	},
+	{
 		.procname	= "udp_mem",
 		.data		= &sysctl_udp_mem,
 		.maxlen		= sizeof(sysctl_udp_mem),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ab64eea..e1714ee 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -283,6 +283,8 @@
 
 int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
 
+int sysctl_tcp_min_tso_segs __read_mostly = 2;
+
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
@@ -785,12 +787,28 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
 	xmit_size_goal = mss_now;
 
 	if (large_allowed && sk_can_gso(sk)) {
-		xmit_size_goal = ((sk->sk_gso_max_size - 1) -
-				  inet_csk(sk)->icsk_af_ops->net_header_len -
-				  inet_csk(sk)->icsk_ext_hdr_len -
-				  tp->tcp_header_len);
+		u32 gso_size, hlen;
+
+		/* Maybe we should/could use sk->sk_prot->max_header here ? */
+		hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
+		       inet_csk(sk)->icsk_ext_hdr_len +
+		       tp->tcp_header_len;
+
+		/* Goal is to send at least one packet per ms,
+		 * not one big TSO packet every 100 ms.
+		 * This preserves ACK clocking and is consistent
+		 * with tcp_tso_should_defer() heuristic.
+		 */
+		gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC);
+		gso_size = max_t(u32, gso_size,
+				 sysctl_tcp_min_tso_segs * mss_now);
+
+		xmit_size_goal = min_t(u32, gso_size,
+				       sk->sk_gso_max_size - 1 - hlen);
 
-		/* TSQ : try to have two TSO segments in flight */
+		/* TSQ : try to have at least two segments in flight
+		 * (one in NIC TX ring, another in Qdisc)
+		 */
 		xmit_size_goal = min_t(u32, xmit_size_goal,
 				       sysctl_tcp_limit_output_bytes >> 1);
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ec492ea..0885502 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -629,6 +629,7 @@ static void tcp_rtt_estimator(struct sock *sk, const __u32 mrtt)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	long m = mrtt; /* RTT */
+	u64 rate;
 
 	/*	The following amusing code comes from Jacobson's
 	 *	article in SIGCOMM '88.  Note that rtt and mdev
@@ -686,6 +687,22 @@ static void tcp_rtt_estimator(struct sock *sk, const __u32 mrtt)
 		tp->mdev_max = tp->rttvar = max(tp->mdev, tcp_rto_min(sk));
 		tp->rtt_seq = tp->snd_nxt;
 	}
+
+	/* Pacing: -> set sk_pacing_rate to 200 % of current rate */
+	rate = (u64)tp->mss_cache * 8 * 2 * USEC_PER_SEC;
+	rate *= max(tp->snd_cwnd, tp->packets_out);
+
+	do_div(rate, jiffies_to_usecs(tp->srtt));
+	/* Correction for small srtt : minimum srtt being 8 (1 ms),
+	 * be conservative and assume rtt = 125 us instead of 1 ms
+	 * We probably need usec resolution in the future.
+	 */
+	if (tp->srtt <= 8 + 2)
+		rate <<= 3;
+	sk->sk_pacing_rate = min_t(u64, rate, ~0U);
+	pr_debug("cwnd %u packets_out %u srtt %u -> rate = %llu bits\n",
+		 tp->snd_cwnd, tp->packets_out,
+		 jiffies_to_usecs(tp->srtt) >> 3, rate << 3);
 }
 
 /* Calculate rto without backoff.  This is the second half of Van Jacobson's

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next] tcp: TSO packets automatic sizing
  2013-08-24  0:29 [PATCH net-next] tcp: TSO packets automatic sizing Eric Dumazet
@ 2013-08-24  3:17 ` Neal Cardwell
  2013-08-24 18:56   ` Eric Dumazet
  2013-08-25  2:46 ` David Miller
  2013-08-26  4:26 ` [PATCH v2 " Eric Dumazet
  2 siblings, 1 reply; 25+ messages in thread
From: Neal Cardwell @ 2013-08-24  3:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Yuchung Cheng, Van Jacobson, Tom Herbert

On Fri, Aug 23, 2013 at 8:29 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> After hearing many people over past years complaining against TSO being
> bursty or even buggy, we are proud to present automatic sizing of TSO
> packets.
>
> One part of the problem is that tcp_tso_should_defer() uses an heuristic
> relying on upcoming ACKS instead of a timer, but more generally, having
> big TSO packets makes little sense for low rates, as it tends to create
> micro bursts on the network, and general consensus is to reduce the
> buffering amount.
>
> This patch introduces a per socket sk_pacing_rate, that approximates
> the current sending rate, and allows us to size the TSO packets so
> that we try to send one packet every ms.
>
> This field could be set by other transports.
>
> Patch has no impact for high speed flows, where having large TSO packets
> makes sense to reach line rate.
>
> For other flows, this helps better packet scheduling and ACK clocking.
>
> This patch increases performance of TCP flows in lossy environments.
>
> A new sysctl (tcp_min_tso_segs) is added, to specify the
> minimal size of a TSO packet (default being 2).
>
> A follow-up patch will provide a new packet scheduler (FQ), using
> sk_pacing_rate as an input to perform optional per flow pacing.
>
> This explains why we chose to set sk_pacing_rate to twice the current
> rate, allowing 'slow start' ramp up.
>
> sk_pacing_rate = 2 * cwnd * mss / srtt
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Van Jacobson <vanj@google.com>
> Cc: Tom Herbert <therbert@google.com>
> ---

I love this! Can't wait to play with it.

Rather than implicitly initializing sk_pacing_rate to 0, I'd suggest
maybe initializing sk_pacing_rate to a value just high enough
(TCP_INIT_CWND * mss / 1ms?) so that in the first transmit the
connection can (as it does today) construct a single TSO jumbogram of
TCP_INIT_CWND segments and send that in a single trip down through the
stack. Hopefully this should keep CPU usage advantages of TSO for
servers that spend most of their time sending replies that are 10MSS
or less, while not making the on-the-wire behavior much burstier than
it would be with the patch as it stands.

I am wondering about the aspect of the patch that sets sk_pacing_rate
to 2x the current rate in tcp_rtt_estimator and then just has to
divide by 2 again in tcp_xmit_size_goal(). It seems the 2x factor is
natural in the packet scheduler context, but at first glance it feels
to me like the multiplication by 2 should be an internal detail of the
optional scheduler, not part of the sk_pacing_rate interface between
the TCP and scheduling layer.

One thing I noticed: something about how the current patch shakes out
causes a basic 10-MSS transfer to take an extra RTT, due to the last
2-segment packet having to wait for an ACK:

# cat iw10-base-case.pkt
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4

0.200 write(4, ..., 14600) = 14600
0.300 < . 1:1(0) ack 11681 win 257

->

# ./packetdrill iw10-base-case.pkt
0.701287 cli > srv: S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
0.701367 srv > cli: S 2822928622:2822928622(0) ack 1 win 29200 <mss
1460,nop,nop,sackOK,nop,wscale 6>
0.801276 cli > srv: . ack 1 win 257
0.801365 srv > cli: . 1:2921(2920) ack 1 win 457
0.801376 srv > cli: . 2921:5841(2920) ack 1 win 457
0.801382 srv > cli: . 5841:8761(2920) ack 1 win 457
0.801386 srv > cli: . 8761:11681(2920) ack 1 win 457
0.901284 cli > srv: . ack 11681 win 257
0.901308 srv > cli: P 11681:14601(2920) ack 1 win 457

I'd try to isolate the exact cause, but it's a bit late in the evening
for me to track this down at this point, and I'll be offline tomorrow.

Thanks again. I love this...

cheers,
neal

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next] tcp: TSO packets automatic sizing
  2013-08-24  3:17 ` Neal Cardwell
@ 2013-08-24 18:56   ` Eric Dumazet
  2013-08-24 20:28     ` Eric Dumazet
  2013-08-25 22:01     ` Yuchung Cheng
  0 siblings, 2 replies; 25+ messages in thread
From: Eric Dumazet @ 2013-08-24 18:56 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, netdev, Yuchung Cheng, Van Jacobson, Tom Herbert

On Fri, 2013-08-23 at 23:17 -0400, Neal Cardwell wrote:

> I love this! Can't wait to play with it.
> 

Totally agree ;)

> Rather than implicitly initializing sk_pacing_rate to 0, I'd suggest
> maybe initializing sk_pacing_rate to a value just high enough
> (TCP_INIT_CWND * mss / 1ms?) so that in the first transmit the
> connection can (as it does today) construct a single TSO jumbogram of
> TCP_INIT_CWND segments and send that in a single trip down through the
> stack. Hopefully this should keep CPU usage advantages of TSO for
> servers that spend most of their time sending replies that are 10MSS
> or less, while not making the on-the-wire behavior much burstier than
> it would be with the patch as it stands.
> 

Yes, this sounds an interesting idea. 

Problem is that if the application does a sendmsg( 1 Mbytes) right after
accept(), we'll cook 14KB TSO packets and are back to initial problem.

Quite frankly TSO advantage for servers sending replies that are 10MSS
or less is thin, because we spend most of cpu cycles in socket
setup/dismantle and ACK processing.

TSO is a win for sockets sending say more than 100KB, or even 1MB



> I am wondering about the aspect of the patch that sets sk_pacing_rate
> to 2x the current rate in tcp_rtt_estimator and then just has to
> divide by 2 again in tcp_xmit_size_goal(). It seems the 2x factor is
> natural in the packet scheduler context, but at first glance it feels
> to me like the multiplication by 2 should be an internal detail of the
> optional scheduler, not part of the sk_pacing_rate interface between
> the TCP and scheduling layer.

I would like to keep FQ as simple as possible, and let the transport
decide for appropriate strategy.

TCP should be the appropriate place to decide on precise delays between
packets. Packet scheduler will only execute the orders coming from TCP.

In this patch, I chose a 200% factor that is conservative enough to make
sure there will be no change in the ramp up. It can later be changed to
get finer control.

> 
> One thing I noticed: something about how the current patch shakes out
> causes a basic 10-MSS transfer to take an extra RTT, due to the last
> 2-segment packet having to wait for an ACK:
> 
> # cat iw10-base-case.pkt
> 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> 0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> 0.000 bind(3, ..., ...) = 0
> 0.000 listen(3, 1) = 0
> 
> 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
> 0.200 < . 1:1(0) ack 1 win 257
> 0.200 accept(3, ..., ...) = 4
> 
> 0.200 write(4, ..., 14600) = 14600
> 0.300 < . 1:1(0) ack 11681 win 257
> 
> ->
> 
> # ./packetdrill iw10-base-case.pkt
> 0.701287 cli > srv: S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
> 0.701367 srv > cli: S 2822928622:2822928622(0) ack 1 win 29200 <mss
> 1460,nop,nop,sackOK,nop,wscale 6>
> 0.801276 cli > srv: . ack 1 win 257
> 0.801365 srv > cli: . 1:2921(2920) ack 1 win 457
> 0.801376 srv > cli: . 2921:5841(2920) ack 1 win 457
> 0.801382 srv > cli: . 5841:8761(2920) ack 1 win 457
> 0.801386 srv > cli: . 8761:11681(2920) ack 1 win 457
> 0.901284 cli > srv: . ack 11681 win 257
> 0.901308 srv > cli: P 11681:14601(2920) ack 1 win 457
> 
> I'd try to isolate the exact cause, but it's a bit late in the evening
> for me to track this down at this point, and I'll be offline tomorrow.

Interesting, but I do not see this on normal ethernet device (bnx2x in
the following traces)

Trying different min_tso_segs exhibits expected different behavior (10
first MSS (14480 bytes of payload) sent in the same ms, no need to wait
an ACK. (RTT = 50ms in this setup)

echo 1 >/proc/sys/net/ipv4/tcp_min_tso_segs

10:40:35.333703 IP 10.246.17.83.50336 > 10.246.17.84.50267: S 3924987356:3924987356(0) win 29200 <mss 1460,sackOK,timestamp 64807623 0,nop,wscale 6>
10:40:35.383835 IP 10.246.17.84.50267 > 10.246.17.83.50336: S 151800535:151800535(0) ack 3924987357 win 28960 <mss 1460,sackOK,timestamp 137049930 64807623,nop,wscale 7>
10:40:35.383868 IP 10.246.17.83.50336 > 10.246.17.84.50267: . ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
10:40:35.383936 IP 10.246.17.83.50336 > 10.246.17.84.50267: . 1:1449(1448) ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
10:40:35.383943 IP 10.246.17.83.50336 > 10.246.17.84.50267: . 1449:2897(1448) ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
10:40:35.383948 IP 10.246.17.83.50336 > 10.246.17.84.50267: . 2897:4345(1448) ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
10:40:35.383952 IP 10.246.17.83.50336 > 10.246.17.84.50267: . 4345:5793(1448) ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
10:40:35.383957 IP 10.246.17.83.50336 > 10.246.17.84.50267: . 5793:7241(1448) ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
10:40:35.383961 IP 10.246.17.83.50336 > 10.246.17.84.50267: . 7241:8689(1448) ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
10:40:35.383965 IP 10.246.17.83.50336 > 10.246.17.84.50267: . 8689:10137(1448) ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
10:40:35.383968 IP 10.246.17.83.50336 > 10.246.17.84.50267: . 10137:11585(1448) ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
10:40:35.383972 IP 10.246.17.83.50336 > 10.246.17.84.50267: . 11585:13033(1448) ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
10:40:35.383975 IP 10.246.17.83.50336 > 10.246.17.84.50267: . 13033:14481(1448) ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
10:40:35.434061 IP 10.246.17.84.50267 > 10.246.17.83.50336: . ack 1449 win 249 <nop,nop,timestamp 137049981 64807673>

echo 2 >/proc/sys/net/ipv4/tcp_min_tso_segs

10:45:24.280183 IP 10.246.17.83.36666 > 10.246.17.84.40648: S 1657754774:1657754774(0) win 29200 <mss 1460,sackOK,timestamp 65096569 0,nop,wscale 6>
10:45:24.330302 IP 10.246.17.84.40648 > 10.246.17.83.36666: S 362153932:362153932(0) ack 1657754775 win 28960 <mss 1460,sackOK,timestamp 137338877 65096569,nop,wscale 7>
10:45:24.330384 IP 10.246.17.83.36666 > 10.246.17.84.40648: . ack 1 win 457 <nop,nop,timestamp 65096620 137338877>
10:45:24.330477 IP 10.246.17.83.36666 > 10.246.17.84.40648: . 1:2897(2896) ack 1 win 457 <nop,nop,timestamp 65096620 137338877>
10:45:24.330497 IP 10.246.17.83.36666 > 10.246.17.84.40648: . 2897:5793(2896) ack 1 win 457 <nop,nop,timestamp 65096620 137338877>
10:45:24.330501 IP 10.246.17.83.36666 > 10.246.17.84.40648: . 5793:8689(2896) ack 1 win 457 <nop,nop,timestamp 65096620 137338877>
10:45:24.330665 IP 10.246.17.83.36666 > 10.246.17.84.40648: . 8689:11585(2896) ack 1 win 457 <nop,nop,timestamp 65096620 137338877>
10:45:24.330674 IP 10.246.17.83.36666 > 10.246.17.84.40648: . 11585:14481(2896) ack 1 win 457 <nop,nop,timestamp 65096620 137338877>
10:45:24.380592 IP 10.246.17.84.40648 > 10.246.17.83.36666: . ack 1449 win 249 <nop,nop,timestamp 137338927 65096620>

echo 3 >/proc/sys/net/ipv4/tcp_min_tso_segs

10:48:51.558662 IP 10.246.17.83.44835 > 10.246.17.84.56145: S 2572155347:2572155347(0) win 29200 <mss 1460,sackOK,timestamp 65303848 0,nop,wscale 6>
10:48:51.608797 IP 10.246.17.84.56145 > 10.246.17.83.44835: S 2206641454:2206641454(0) ack 2572155348 win 28960 <mss 1460,sackOK,timestamp 137546155 65303848,nop,wscale 7>
10:48:51.608824 IP 10.246.17.83.44835 > 10.246.17.84.56145: . ack 1 win 457 <nop,nop,timestamp 65303898 137546155>
10:48:51.608901 IP 10.246.17.83.44835 > 10.246.17.84.56145: . 1:4345(4344) ack 1 win 457 <nop,nop,timestamp 65303898 137546155>
10:48:51.608911 IP 10.246.17.83.44835 > 10.246.17.84.56145: . 4345:8689(4344) ack 1 win 457 <nop,nop,timestamp 65303898 137546155>
10:48:51.608917 IP 10.246.17.83.44835 > 10.246.17.84.56145: . 8689:13033(4344) ack 1 win 457 <nop,nop,timestamp 65303898 137546155>
10:48:51.608927 IP 10.246.17.83.44835 > 10.246.17.84.56145: . 13033:14481(1448) ack 1 win 457 <nop,nop,timestamp 65303898 137546155>
10:48:51.659018 IP 10.246.17.84.56145 > 10.246.17.83.44835: . ack 1449 win 249 <nop,nop,timestamp 137546206 65303898>
10:48:51.659102 IP 10.246.17.83.44835 > 10.246.17.84.56145: . 14481:17377(2896) ack 1 win 457 <nop,nop,timestamp 65303948 137546206>
10:48:51.659019 IP 10.246.17.84.56145 > 10.246.17.83.44835: . ack 2897 win 272 <nop,nop,timestamp 137546206 65303898>
10:48:51.659113 IP 10.246.17.83.44835 > 10.246.17.84.56145: P 17377:18825(1448) ack 1 win 457 <nop,nop,timestamp 65303948 137546206>
10:48:51.659124 IP 10.246.17.84.56145 > 10.246.17.83.44835: . ack 4345 win 295 <nop,nop,timestamp 137546206 65303898>

echo 4 >/proc/sys/net/ipv4/tcp_min_tso_segs

10:49:41.553016 IP 10.246.17.83.51499 > 10.246.17.84.37071: S 770187706:770187706(0) win 29200 <mss 1460,sackOK,timestamp 65353842 0,nop,wscale 6>
10:49:41.603149 IP 10.246.17.84.37071 > 10.246.17.83.51499: S 3342827191:3342827191(0) ack 770187707 win 28960 <mss 1460,sackOK,timestamp 137596150 65353842,nop,wscale 7>
10:49:41.603223 IP 10.246.17.83.51499 > 10.246.17.84.37071: . ack 1 win 457 <nop,nop,timestamp 65353892 137596150>
10:49:41.603307 IP 10.246.17.83.51499 > 10.246.17.84.37071: . 1:5793(5792) ack 1 win 457 <nop,nop,timestamp 65353893 137596150>
10:49:41.603317 IP 10.246.17.83.51499 > 10.246.17.84.37071: . 5793:11585(5792) ack 1 win 457 <nop,nop,timestamp 65353893 137596150>
10:49:41.603329 IP 10.246.17.83.51499 > 10.246.17.84.37071: . 11585:14481(2896) ack 1 win 457 <nop,nop,timestamp 65353893 137596150>
10:49:41.653448 IP 10.246.17.84.37071 > 10.246.17.83.51499: . ack 1449 win 249 <nop,nop,timestamp 137596200 65353893>
10:49:41.653531 IP 10.246.17.83.51499 > 10.246.17.84.37071: . 14481:17377(2896) ack 1 win 457 <nop,nop,timestamp 65353943 137596200>
10:49:41.653450 IP 10.246.17.84.37071 > 10.246.17.83.51499: . ack 2897 win 272 <nop,nop,timestamp 137596200 65353893>
10:49:41.653618 IP 10.246.17.83.51499 > 10.246.17.84.37071: . 17377:20273(2896) ack 1 win 457 <nop,nop,timestamp 65353943 137596200>

echo 5 >/proc/sys/net/ipv4/tcp_min_tso_segs

10:50:33.626270 IP 10.246.17.83.52633 > 10.246.17.84.33693: S 1635294551:1635294551(0) win 29200 <mss 1460,sackOK,timestamp 65405916 0,nop,wscale 6>
10:50:33.676407 IP 10.246.17.84.33693 > 10.246.17.83.52633: S 1023650170:1023650170(0) ack 1635294552 win 28960 <mss 1460,sackOK,timestamp 137648223 65405916,nop,wscale 7>
10:50:33.676489 IP 10.246.17.83.52633 > 10.246.17.84.33693: . ack 1 win 457 <nop,nop,timestamp 65405966 137648223>
10:50:33.676571 IP 10.246.17.83.52633 > 10.246.17.84.33693: . 1:7241(7240) ack 1 win 457 <nop,nop,timestamp 65405966 137648223>
10:50:33.676578 IP 10.246.17.83.52633 > 10.246.17.84.33693: . 7241:14481(7240) ack 1 win 457 <nop,nop,timestamp 65405966 137648223>
10:50:33.726706 IP 10.246.17.84.33693 > 10.246.17.83.52633: . ack 1449 win 249 <nop,nop,timestamp 137648273 65405966>
10:50:33.726707 IP 10.246.17.84.33693 > 10.246.17.83.52633: . ack 2897 win 272 <nop,nop,timestamp 137648273 65405966>
10:50:33.726792 IP 10.246.17.83.52633 > 10.246.17.84.33693: . 14481:20273(5792) ack 1 win 457 <nop,nop,timestamp 65406016 137648273>
10:50:33.726781 IP 10.246.17.84.33693 > 10.246.17.83.52633: . ack 4345 win 295 <nop,nop,timestamp 137648273 65405966>
10:50:33.726986 IP 10.246.17.84.33693 > 10.246.17.83.52633: . ack 5793 win 317 <nop,nop,timestamp 137648274 65405966>
10:50:33.727101 IP 10.246.17.84.33693 > 10.246.17.83.52633: . ack 7241 win 340 <nop,nop,timestamp 137648274 65405966>
10:50:33.727117 IP 10.246.17.83.52633 > 10.246.17.84.33693: P 20273:27513(7240) ack 1 win 457 <nop,nop,timestamp 65406016 137648274>
10:50:33.727258 IP 10.246.17.84.33693 > 10.246.17.83.52633: . ack 8689 win 340 <nop,nop,timestamp 137648274 65405966>
10:50:33.727408 IP 10.246.17.84.33693 > 10.246.17.83.52633: . ack 10137 win 340 <nop,nop,timestamp 137648274 65405966>

echo 6 >/proc/sys/net/ipv4/tcp_min_tso_segs

10:51:23.295063 IP 10.246.17.83.49096 > 10.246.17.84.43872: S 1841824181:1841824181(0) win 29200 <mss 1460,sackOK,timestamp 65455584 0,nop,wscale 6>
10:51:23.345207 IP 10.246.17.84.43872 > 10.246.17.83.49096: S 2837501410:2837501410(0) ack 1841824182 win 28960 <mss 1460,sackOK,timestamp 137697892 65455584,nop,wscale 7>
10:51:23.345237 IP 10.246.17.83.49096 > 10.246.17.84.43872: . ack 1 win 457 <nop,nop,timestamp 65455635 137697892>
10:51:23.345311 IP 10.246.17.83.49096 > 10.246.17.84.43872: . 1:8689(8688) ack 1 win 457 <nop,nop,timestamp 65455635 137697892>
10:51:23.345330 IP 10.246.17.83.49096 > 10.246.17.84.43872: . 8689:14481(5792) ack 1 win 457 <nop,nop,timestamp 65455635 137697892>
10:51:23.395453 IP 10.246.17.84.43872 > 10.246.17.83.49096: . ack 1449 win 249 <nop,nop,timestamp 137697942 65455635>
10:51:23.395454 IP 10.246.17.84.43872 > 10.246.17.83.49096: . ack 2897 win 272 <nop,nop,timestamp 137697942 65455635>
10:51:23.395544 IP 10.246.17.83.49096 > 10.246.17.84.43872: . 14481:20273(5792) ack 1 win 457 <nop,nop,timestamp 65455685 137697942>
10:51:23.395533 IP 10.246.17.84.43872 > 10.246.17.83.49096: . ack 4345 win 295 <nop,nop,timestamp 137697942 65455635>
10:51:23.395631 IP 10.246.17.83.49096 > 10.246.17.84.43872: . 20273:23169(2896) ack 1 win 457 <nop,nop,timestamp 65455685 137697942>
10:51:23.395746 IP 10.246.17.84.43872 > 10.246.17.83.49096: . ack 5793 win 317 <nop,nop,timestamp 137697942 65455635>
10:51:23.395854 IP 10.246.17.84.43872 > 10.246.17.83.49096: . ack 7241 win 340 <nop,nop,timestamp 137697943 65455635>
10:51:23.396049 IP 10.246.17.84.43872 > 10.246.17.83.49096: . ack 8689 win 340 <nop,nop,timestamp 137697943 65455635>
10:51:23.396199 IP 10.246.17.83.49096 > 10.246.17.84.43872: P 23169:31857(8688) ack 1 win 457 <nop,nop,timestamp 65455685 137697943>

echo 7 >/proc/sys/net/ipv4/tcp_min_tso_segs

10:51:58.219334 IP 10.246.17.83.58882 > 10.246.17.84.41983: S 3763353310:3763353310(0) win 29200 <mss 1460,sackOK,timestamp 65490509 0,nop,wscale 6>
10:51:58.269455 IP 10.246.17.84.41983 > 10.246.17.83.58882: S 1445588492:1445588492(0) ack 3763353311 win 28960 <mss 1460,sackOK,timestamp 137732816 65490509,nop,wscale 7>
10:51:58.269536 IP 10.246.17.83.58882 > 10.246.17.84.41983: . ack 1 win 457 <nop,nop,timestamp 65490559 137732816>
10:51:58.269634 IP 10.246.17.83.58882 > 10.246.17.84.41983: . 1:10137(10136) ack 1 win 457 <nop,nop,timestamp 65490559 137732816>
10:51:58.269646 IP 10.246.17.83.58882 > 10.246.17.84.41983: . 10137:14481(4344) ack 1 win 457 <nop,nop,timestamp 65490559 137732816>
10:51:58.319765 IP 10.246.17.84.41983 > 10.246.17.83.58882: . ack 1449 win 249 <nop,nop,timestamp 137732866 65490559>
10:51:58.319846 IP 10.246.17.83.58882 > 10.246.17.84.41983: . 14481:17377(2896) ack 1 win 457 <nop,nop,timestamp 65490609 137732866>
10:51:58.319767 IP 10.246.17.84.41983 > 10.246.17.83.58882: . ack 2897 win 272 <nop,nop,timestamp 137732866 65490559>
10:51:58.319843 IP 10.246.17.84.41983 > 10.246.17.83.58882: . ack 4345 win 295 <nop,nop,timestamp 137732867 65490559>
10:51:58.319911 IP 10.246.17.83.58882 > 10.246.17.84.41983: . 17377:23169(5792) ack 1 win 457 <nop,nop,timestamp 65490609 137732867>
10:51:58.320068 IP 10.246.17.84.41983 > 10.246.17.83.58882: . ack 5793 win 317 <nop,nop,timestamp 137732867 65490559>
10:51:58.320180 IP 10.246.17.84.41983 > 10.246.17.83.58882: . ack 7241 win 340 <nop,nop,timestamp 137732867 65490559>
10:51:58.320287 IP 10.246.17.84.41983 > 10.246.17.83.58882: . ack 8689 win 340 <nop,nop,timestamp 137732867 65490559>
10:51:58.320295 IP 10.246.17.83.58882 > 10.246.17.84.41983: . 23169:31857(8688) ack 1 win 457 <nop,nop,timestamp 65490610 137732867>
10:51:58.320496 IP 10.246.17.84.41983 > 10.246.17.83.58882: . ack 10137 win 340 <nop,nop,timestamp 137732867 65490559>
10:51:58.320513 IP 10.246.17.83.58882 > 10.246.17.84.41983: . 31857:33305(1448) ack 1 win 457 <nop,nop,timestamp 65490610 137732867>

echo 8 >/proc/sys/net/ipv4/tcp_min_tso_segs

10:52:50.398941 IP 10.246.17.83.32908 > 10.246.17.84.65099: S 678482142:678482142(0) win 29200 <mss 1460,sackOK,timestamp 65542688 0,nop,wscale 6>
10:52:50.449061 IP 10.246.17.84.65099 > 10.246.17.83.32908: S 3229813359:3229813359(0) ack 678482143 win 28960 <mss 1460,sackOK,timestamp 137784996 65542688,nop,wscale 7>
10:52:50.449146 IP 10.246.17.83.32908 > 10.246.17.84.65099: . ack 1 win 457 <nop,nop,timestamp 65542738 137784996>
10:52:50.449258 IP 10.246.17.83.32908 > 10.246.17.84.65099: . 1:11585(11584) ack 1 win 457 <nop,nop,timestamp 65542739 137784996>
10:52:50.449384 IP 10.246.17.83.32908 > 10.246.17.84.65099: . 11585:14481(2896) ack 1 win 457 <nop,nop,timestamp 65542739 137784996>
10:52:50.499379 IP 10.246.17.84.65099 > 10.246.17.83.32908: . ack 1449 win 249 <nop,nop,timestamp 137785046 65542739>
10:52:50.499462 IP 10.246.17.83.32908 > 10.246.17.84.65099: . 14481:17377(2896) ack 1 win 457 <nop,nop,timestamp 65542789 137785046>
10:52:50.499381 IP 10.246.17.84.65099 > 10.246.17.83.32908: . ack 2897 win 272 <nop,nop,timestamp 137785046 65542739>
10:52:50.499552 IP 10.246.17.83.32908 > 10.246.17.84.65099: . 17377:20273(2896) ack 1 win 457 <nop,nop,timestamp 65542789 137785046>
10:52:50.499552 IP 10.246.17.84.65099 > 10.246.17.83.32908: . ack 4345 win 295 <nop,nop,timestamp 137785046 65542739>
10:52:50.499661 IP 10.246.17.84.65099 > 10.246.17.83.32908: . ack 5793 win 317 <nop,nop,timestamp 137785046 65542739>
10:52:50.499806 IP 10.246.17.84.65099 > 10.246.17.83.32908: . ack 7241 win 340 <nop,nop,timestamp 137785046 65542739>
10:52:50.499845 IP 10.246.17.83.32908 > 10.246.17.84.65099: . 20273:28961(8688) ack 1 win 457 <nop,nop,timestamp 65542789 137785046>
10:52:50.500006 IP 10.246.17.84.65099 > 10.246.17.83.32908: . ack 8689 win 340 <nop,nop,timestamp 137785047 65542739>

echo 9 >/proc/sys/net/ipv4/tcp_min_tso_segs

10:53:31.504788 IP 10.246.17.83.59687 > 10.246.17.84.38716: S 1238515537:1238515537(0) win 29200 <mss 1460,sackOK,timestamp 65583794 0,nop,wscale 6>
10:53:31.554898 IP 10.246.17.84.38716 > 10.246.17.83.59687: S 667062900:667062900(0) ack 1238515538 win 28960 <mss 1460,sackOK,timestamp 137826102 65583794,nop,wscale 7>
10:53:31.554973 IP 10.246.17.83.59687 > 10.246.17.84.38716: . ack 1 win 457 <nop,nop,timestamp 65583844 137826102>
10:53:31.555050 IP 10.246.17.83.59687 > 10.246.17.84.38716: . 1:13033(13032) ack 1 win 457 <nop,nop,timestamp 65583844 137826102>
10:53:31.555072 IP 10.246.17.83.59687 > 10.246.17.84.38716: . 13033:14481(1448) ack 1 win 457 <nop,nop,timestamp 65583844 137826102>
10:53:31.605154 IP 10.246.17.84.38716 > 10.246.17.83.59687: . ack 1449 win 249 <nop,nop,timestamp 137826152 65583844>
10:53:31.605235 IP 10.246.17.83.59687 > 10.246.17.84.38716: . 14481:17377(2896) ack 1 win 457 <nop,nop,timestamp 65583895 137826152>
10:53:31.605156 IP 10.246.17.84.38716 > 10.246.17.83.59687: . ack 2897 win 272 <nop,nop,timestamp 137826152 65583844>
10:53:31.605293 IP 10.246.17.84.38716 > 10.246.17.83.59687: . ack 4345 win 295 <nop,nop,timestamp 137826152 65583844>
10:53:31.605325 IP 10.246.17.83.59687 > 10.246.17.84.38716: . 17377:23169(5792) ack 1 win 457 <nop,nop,timestamp 65583895 137826152>
10:53:31.605461 IP 10.246.17.84.38716 > 10.246.17.83.59687: . ack 5793 win 317 <nop,nop,timestamp 137826152 65583844>
10:53:31.605599 IP 10.246.17.84.38716 > 10.246.17.83.59687: . ack 7241 win 340 <nop,nop,timestamp 137826152 65583844>
10:53:31.605750 IP 10.246.17.84.38716 > 10.246.17.83.59687: . ack 8689 win 340 <nop,nop,timestamp 137826152 65583844>
10:53:31.605834 IP 10.246.17.83.59687 > 10.246.17.84.38716: . 23169:31857(8688) ack 1 win 457 <nop,nop,timestamp 65583895 137826152>
10:53:31.605899 IP 10.246.17.84.38716 > 10.246.17.83.59687: . ack 10137 win 340 <nop,nop,timestamp 137826153 65583844>
10:53:31.606055 IP 10.246.17.84.38716 > 10.246.17.83.59687: . ack 11585 win 340 <nop,nop,timestamp 137826153 65583844>
10:53:31.606155 IP 10.246.17.83.59687 > 10.246.17.84.38716: . 31857:36201(4344) ack 1 win 457 <nop,nop,timestamp 65583895 137826153>
10:53:31.606157 IP 10.246.17.84.38716 > 10.246.17.83.59687: . ack 13033 win 340 <nop,nop,timestamp 137826153 65583844>

echo 10 >/proc/sys/net/ipv4/tcp_min_tso_segs

10:54:15.974831 IP 10.246.17.83.53733 > 10.246.17.84.34163: S 690526362:690526362(0) win 29200 <mss 1460,sackOK,timestamp 65628264 0,nop,wscale 6>
10:54:16.024978 IP 10.246.17.84.34163 > 10.246.17.83.53733: S 1914393851:1914393851(0) ack 690526363 win 28960 <mss 1460,sackOK,timestamp 137870572 65628264,nop,wscale 7>
10:54:16.025047 IP 10.246.17.83.53733 > 10.246.17.84.34163: . ack 1 win 457 <nop,nop,timestamp 65628314 137870572>
10:54:16.025132 IP 10.246.17.83.53733 > 10.246.17.84.34163: . 1:14481(14480) ack 1 win 457 <nop,nop,timestamp 65628314 137870572>
10:54:16.075247 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 1449 win 249 <nop,nop,timestamp 137870622 65628314>
10:54:16.075249 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 2897 win 272 <nop,nop,timestamp 137870622 65628314>
10:54:16.075334 IP 10.246.17.83.53733 > 10.246.17.84.34163: . 14481:20273(5792) ack 1 win 457 <nop,nop,timestamp 65628365 137870622>
10:54:16.075452 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 4345 win 295 <nop,nop,timestamp 137870622 65628314>
10:54:16.075570 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 5793 win 317 <nop,nop,timestamp 137870622 65628314>
10:54:16.075674 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 7241 win 340 <nop,nop,timestamp 137870622 65628314>
10:54:16.075698 IP 10.246.17.83.53733 > 10.246.17.84.34163: . 20273:28961(8688) ack 1 win 457 <nop,nop,timestamp 65628365 137870622>
10:54:16.075833 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 8689 win 340 <nop,nop,timestamp 137870622 65628314>
10:54:16.075990 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 10137 win 340 <nop,nop,timestamp 137870623 65628314>
10:54:16.076116 IP 10.246.17.83.53733 > 10.246.17.84.34163: . 28961:34753(5792) ack 1 win 457 <nop,nop,timestamp 65628365 137870623>
10:54:16.076096 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 11585 win 340 <nop,nop,timestamp 137870623 65628314>
10:54:16.076291 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 13033 win 340 <nop,nop,timestamp 137870623 65628314>
10:54:16.076435 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 14481 win 340 <nop,nop,timestamp 137870623 65628314>
10:54:16.125492 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 15929 win 340 <nop,nop,timestamp 137870672 65628365>
10:54:16.125569 IP 10.246.17.83.53733 > 10.246.17.84.34163: . 34753:46337(11584) ack 1 win 457 <nop,nop,timestamp 65628415 137870672>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next] tcp: TSO packets automatic sizing
  2013-08-24 18:56   ` Eric Dumazet
@ 2013-08-24 20:28     ` Eric Dumazet
  2013-08-25 22:01     ` Yuchung Cheng
  1 sibling, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2013-08-24 20:28 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, netdev, Yuchung Cheng, Van Jacobson, Tom Herbert

On Sat, 2013-08-24 at 11:56 -0700, Eric Dumazet wrote:

> Problem is that if the application does a sendmsg( 1 Mbytes) right after
> accept(), we'll cook 14KB TSO packets and are back to initial problem.
> 
> Quite frankly TSO advantage for servers sending replies that are 10MSS
> or less is thin, because we spend most of cpu cycles in socket
> setup/dismantle and ACK processing.
> 
> TSO is a win for sockets sending say more than 100KB, or even 1MB

Another interesting point having small packets at the beginning of the
connection when/if pacing is enabled in the (FQ) packet scheduler,
an incorrect initial rtt would have lower impact :

13:14:45.271930 IP 10.246.17.83.41052 > 10.246.17.84.41129: S 2688061178:2688061178(0) win 29200 <mss 1460,sackOK,timestamp 281602 0,nop,wscale 6>
13:14:45.322055 IP 10.246.17.84.41129 > 10.246.17.83.41052: S 1339982632:1339982632(0) ack 2688061179 win 28960 <mss 1460,sackOK,timestamp 146299869 281602,nop,wscale 7>
13:14:45.322126 IP 10.246.17.83.41052 > 10.246.17.84.41129: . ack 1 win 457 <nop,nop,timestamp 281652 146299869>
13:14:45.322245 IP 10.246.17.83.41052 > 10.246.17.84.41129: . 1:1449(1448) ack 1 win 457 <nop,nop,timestamp 281652 146299869>
13:14:45.324944 IP 10.246.17.83.41052 > 10.246.17.84.41129: . 1449:2897(1448) ack 1 win 457 <nop,nop,timestamp 281652 146299869>
13:14:45.327600 IP 10.246.17.83.41052 > 10.246.17.84.41129: . 2897:4345(1448) ack 1 win 457 <nop,nop,timestamp 281652 146299869>
13:14:45.330301 IP 10.246.17.83.41052 > 10.246.17.84.41129: . 4345:5793(1448) ack 1 win 457 <nop,nop,timestamp 281652 146299869>
13:14:45.333001 IP 10.246.17.83.41052 > 10.246.17.84.41129: . 5793:7241(1448) ack 1 win 457 <nop,nop,timestamp 281653 146299869>
13:14:45.335697 IP 10.246.17.83.41052 > 10.246.17.84.41129: . 7241:8689(1448) ack 1 win 457 <nop,nop,timestamp 281653 146299869>
13:14:45.338392 IP 10.246.17.83.41052 > 10.246.17.84.41129: . 8689:10137(1448) ack 1 win 457 <nop,nop,timestamp 281653 146299869>
13:14:45.341087 IP 10.246.17.83.41052 > 10.246.17.84.41129: . 10137:11585(1448) ack 1 win 457 <nop,nop,timestamp 281653 146299869>
13:14:45.343770 IP 10.246.17.83.41052 > 10.246.17.84.41129: . 11585:13033(1448) ack 1 win 457 <nop,nop,timestamp 281653 146299869>
13:14:45.346471 IP 10.246.17.83.41052 > 10.246.17.84.41129: . 13033:14481(1448) ack 1 win 457 <nop,nop,timestamp 281653 146299869>
13:14:45.372577 IP 10.246.17.84.41129 > 10.246.17.83.41052: . ack 1449 win 249 <nop,nop,timestamp 146299919 281652>

If the "ack 1449" coming back from client was coming sooner than expected,
this could change the srtt estimation and packet scheduler could
send remaining packets sooner.

This makes me think that srtt computation could be more precise.

First RTT sample sets SRTT=RTT

But second sample sets to SRTT = SRTT*7/8 + nRTT,
while it probably should do SRTT = (SRTT + nRTT)/2

Third sample also should do :  SRTT = SRTT*2/3 + nRTT/3
...

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next] tcp: TSO packets automatic sizing
  2013-08-24 18:56   ` Eric Dumazet
  2013-08-24 20:28     ` Eric Dumazet
@ 2013-08-25 22:01     ` Yuchung Cheng
  2013-08-26  0:37       ` Eric Dumazet
  1 sibling, 1 reply; 25+ messages in thread
From: Yuchung Cheng @ 2013-08-25 22:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, David Miller, netdev, Van Jacobson, Tom Herbert

On Sat, Aug 24, 2013 at 11:56 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> On Fri, 2013-08-23 at 23:17 -0400, Neal Cardwell wrote:
>
> > I love this! Can't wait to play with it.
> >
>
> Totally agree ;)
>
> > Rather than implicitly initializing sk_pacing_rate to 0, I'd suggest
> > maybe initializing sk_pacing_rate to a value just high enough
> > (TCP_INIT_CWND * mss / 1ms?) so that in the first transmit the
> > connection can (as it does today) construct a single TSO jumbogram of
> > TCP_INIT_CWND segments and send that in a single trip down through the
> > stack. Hopefully this should keep CPU usage advantages of TSO for
> > servers that spend most of their time sending replies that are 10MSS
> > or less, while not making the on-the-wire behavior much burstier than
> > it would be with the patch as it stands.
> >
>
> Yes, this sounds an interesting idea.
>
> Problem is that if the application does a sendmsg( 1 Mbytes) right after
> accept(), we'll cook 14KB TSO packets and are back to initial problem.
>
> Quite frankly TSO advantage for servers sending replies that are 10MSS
> or less is thin, because we spend most of cpu cycles in socket
> setup/dismantle and ACK processing.
>
> TSO is a win for sockets sending say more than 100KB, or even 1MB
>
>
>
> > I am wondering about the aspect of the patch that sets sk_pacing_rate
> > to 2x the current rate in tcp_rtt_estimator and then just has to
> > divide by 2 again in tcp_xmit_size_goal(). It seems the 2x factor is
> > natural in the packet scheduler context, but at first glance it feels
> > to me like the multiplication by 2 should be an internal detail of the
> > optional scheduler, not part of the sk_pacing_rate interface between
> > the TCP and scheduling layer.
>
> I would like to keep FQ as simple as possible, and let the transport
> decide for appropriate strategy.
>
> TCP should be the appropriate place to decide on precise delays between
> packets. Packet scheduler will only execute the orders coming from TCP.
>
> In this patch, I chose a 200% factor that is conservative enough to make
> sure there will be no change in the ramp up. It can later be changed to
> get finer control.
>
> >
> > One thing I noticed: something about how the current patch shakes out
> > causes a basic 10-MSS transfer to take an extra RTT, due to the last
> > 2-segment packet having to wait for an ACK:
> >
> > # cat iw10-base-case.pkt
> > 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> > 0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > 0.000 bind(3, ..., ...) = 0
> > 0.000 listen(3, 1) = 0
> >
> > 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
> > 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
> > 0.200 < . 1:1(0) ack 1 win 257
> > 0.200 accept(3, ..., ...) = 4
> >
> > 0.200 write(4, ..., 14600) = 14600
> > 0.300 < . 1:1(0) ack 11681 win 257
> >
> > ->
> >
> > # ./packetdrill iw10-base-case.pkt
> > 0.701287 cli > srv: S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
> > 0.701367 srv > cli: S 2822928622:2822928622(0) ack 1 win 29200 <mss
> > 1460,nop,nop,sackOK,nop,wscale 6>
> > 0.801276 cli > srv: . ack 1 win 257
> > 0.801365 srv > cli: . 1:2921(2920) ack 1 win 457
> > 0.801376 srv > cli: . 2921:5841(2920) ack 1 win 457
> > 0.801382 srv > cli: . 5841:8761(2920) ack 1 win 457
> > 0.801386 srv > cli: . 8761:11681(2920) ack 1 win 457
> > 0.901284 cli > srv: . ack 11681 win 257
> > 0.901308 srv > cli: P 11681:14601(2920) ack 1 win 457
> >
> > I'd try to isolate the exact cause, but it's a bit late in the evening
> > for me to track this down at this point, and I'll be offline tomorrow.
>
> Interesting, but I do not see this on normal ethernet device (bnx2x in
> the following traces)

I suspect the issue is triggered by when write size is between 9 to 10
full MSS packets. e.g., Neal's packetdrill test is writing data of 10
full size mss. I was able to reproduce this from both packetdrill and
a toy socket program on a real network (~62ms RTT, 1430 MSS). Here is
the tcpdump with relative timings (-ttt).

13000 bytes init write size:
20. 948886 IP 10.246.17.76.60429 > srv: S 3733683575:3733683575(0) win
29200 <mss 1460,nop,nop,sackOK,nop,wscale 6>
062381 IP srv > 10.246.17.76.60429: S 871819030:871819030(0) ack
3733683576 win 62920 <mss 1430,nop,nop,sackOK,nop,wscale 6>
000022 IP 10.246.17.76.60429 > srv: . ack 1 win 457
000022 IP 10.246.17.76.60429 > srv: . 1:2861(2860) ack 1 win 457
000009 IP 10.246.17.76.60429 > srv: . 2861:5721(2860) ack 1 win 457
000010 IP 10.246.17.76.60429 > srv: . 5721:8581(2860) ack 1 win 457
000004 IP 10.246.17.76.60429 > srv: . 8581:11441(2860) ack 1 win 457
062604 IP srv > 10.246.17.76.60429: . ack 11441 win 858
000019 IP 10.246.17.76.60429 > srv: . 11441:12871(1430) ack 1 win 457
000004 IP 10.246.17.76.60429 > srv: P 12871:13001(130) ack 1 win 457

14300 bytes init write size:
lpq76:/export/hda3/tmp/gtests/net/tcp# /tmp/pacing srv 14300
22. 467698 IP cli > srv: S 2400920852:2400920852(0) win 29200 <mss
1460,nop,nop,sackOK,nop,wscale 6>
062536 IP srv > cli: S 2816755090:2816755090(0) ack 2400920853 win
62920 <mss 1430,nop,nop,sackOK,nop,wscale 6>
000017 IP cli > srv: . ack 1 win 457
000016 IP cli > srv: . 1:2861(2860) ack 1 win 457
000008 IP cli > srv: . 2861:5721(2860) ack 1 win 457
000013 IP cli > srv: . 5721:8581(2860) ack 1 win 457
000007 IP cli > srv: . 8581:11441(2860) ack 1 win 457
062745 IP srv > cli: . ack 11441 win 858
000013 IP cli > srv: P 11441:14301(2860) ack 1 win 457

Any idea to get rid of this undesirable extra RTT delay?




Also we probably want to update the rate when both RTT and cwnd are
updated (i.e., after fastretrans_alert()), and the code really
deserves a separate function since it's a major feature. i.e.,

+/* Set the transmission rate of TSO segs in the packet scheduler to
+ * reduce the bursts created by TCP. Note: this is not the conventional
+ * TCP pacing. TCP is still ack-clocked and window based, but we
+ * smooth the burst on large write when packets in flight is significantly
+ * lower than cwnd (or rwin).
+ */
+static void tcp_update_tso_segs_pacing(struct sock* sk)
+{
+       struct tcp_sock *tp = tcp_sk(sk);
+       /* Pacing: -> set sk_pacing_rate to 200 % of current rate */
+       u64 rate = (u64)tp->mss_cache * 8 * 2 * USEC_PER_SEC;
+
+       rate *= max(tp->snd_cwnd, tp->packets_out);
+       do_div(rate, jiffies_to_usecs(tp->srtt));
+       /* Correction for small srtt : minimum srtt being 8 (1 ms),
+        * be conservative and assume rtt = 125 us instead of 1 ms
+        * We probably need usec resolution in the future.
+        */
+       if (tp->srtt <= 8 + 2)
+               rate <<= 3;
+       sk->sk_pacing_rate = min_t(u64, rate, ~0U);
+       pr_debug("cwnd %u packets_out %u srtt %u -> rate = %llu bits\n",
+                tp->snd_cwnd, tp->packets_out,
+                jiffies_to_usecs(tp->srtt) >> 3, rate << 3);
+}
+
 /* This routine deals with incoming acks, but not outgoing ones. */
 static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 {
@@ -3295,7 +3304,7 @@ static int tcp_ack(struct sock *sk, const struct
sk_buff *skb, int flag)
        u32 ack_seq = TCP_SKB_CB(skb)->seq;
        u32 ack = TCP_SKB_CB(skb)->ack_seq;
        bool is_dupack = false;
-       u32 prior_in_flight;
+       u32 prior_in_flight, prior_cwnd = tp->snd_cwnd, prior_rtt = tp->srtt;
        u32 prior_fackets;
        int prior_packets = tp->packets_out;
        const int prior_unsacked = tp->packets_out - tp->sacked_out;
@@ -3400,6 +3409,9 @@ static int tcp_ack(struct sock *sk, const struct
sk_buff *skb, int flag)

        if (icsk->icsk_pending == ICSK_TIME_RETRANS)
                tcp_schedule_loss_probe(sk);
+
+       if (tp->srtt != prior_rtt || tp->snd_cwnd != prior_cwnd)
+               tcp_update_tso_segs_pacing(sk);
        return 1;

 no_queue:


>
> Trying different min_tso_segs exhibits expected different behavior (10
> first MSS (14480 bytes of payload) sent in the same ms, no need to wait
> an ACK. (RTT = 50ms in this setup)
>
> echo 1 >/proc/sys/net/ipv4/tcp_min_tso_segs
>
> 10:40:35.333703 IP 10.246.17.83.50336 > 10.246.17.84.50267: S 3924987356:3924987356(0) win 29200 <mss 1460,sackOK,timestamp 64807623 0,nop,wscale 6>
> 10:40:35.383835 IP 10.246.17.84.50267 > 10.246.17.83.50336: S 151800535:151800535(0) ack 3924987357 win 28960 <mss 1460,sackOK,timestamp 137049930 64807623,nop,wscale 7>
> 10:40:35.383868 IP 10.246.17.83.50336 > 10.246.17.84.50267: . ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
> 10:40:35.383936 IP 10.246.17.83.50336 > 10.246.17.84.50267: . 1:1449(1448) ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
> 10:40:35.383943 IP 10.246.17.83.50336 > 10.246.17.84.50267: . 1449:2897(1448) ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
> 10:40:35.383948 IP 10.246.17.83.50336 > 10.246.17.84.50267: . 2897:4345(1448) ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
> 10:40:35.383952 IP 10.246.17.83.50336 > 10.246.17.84.50267: . 4345:5793(1448) ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
> 10:40:35.383957 IP 10.246.17.83.50336 > 10.246.17.84.50267: . 5793:7241(1448) ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
> 10:40:35.383961 IP 10.246.17.83.50336 > 10.246.17.84.50267: . 7241:8689(1448) ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
> 10:40:35.383965 IP 10.246.17.83.50336 > 10.246.17.84.50267: . 8689:10137(1448) ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
> 10:40:35.383968 IP 10.246.17.83.50336 > 10.246.17.84.50267: . 10137:11585(1448) ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
> 10:40:35.383972 IP 10.246.17.83.50336 > 10.246.17.84.50267: . 11585:13033(1448) ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
> 10:40:35.383975 IP 10.246.17.83.50336 > 10.246.17.84.50267: . 13033:14481(1448) ack 1 win 457 <nop,nop,timestamp 64807673 137049930>
> 10:40:35.434061 IP 10.246.17.84.50267 > 10.246.17.83.50336: . ack 1449 win 249 <nop,nop,timestamp 137049981 64807673>
>
> echo 2 >/proc/sys/net/ipv4/tcp_min_tso_segs
>
> 10:45:24.280183 IP 10.246.17.83.36666 > 10.246.17.84.40648: S 1657754774:1657754774(0) win 29200 <mss 1460,sackOK,timestamp 65096569 0,nop,wscale 6>
> 10:45:24.330302 IP 10.246.17.84.40648 > 10.246.17.83.36666: S 362153932:362153932(0) ack 1657754775 win 28960 <mss 1460,sackOK,timestamp 137338877 65096569,nop,wscale 7>
> 10:45:24.330384 IP 10.246.17.83.36666 > 10.246.17.84.40648: . ack 1 win 457 <nop,nop,timestamp 65096620 137338877>
> 10:45:24.330477 IP 10.246.17.83.36666 > 10.246.17.84.40648: . 1:2897(2896) ack 1 win 457 <nop,nop,timestamp 65096620 137338877>
> 10:45:24.330497 IP 10.246.17.83.36666 > 10.246.17.84.40648: . 2897:5793(2896) ack 1 win 457 <nop,nop,timestamp 65096620 137338877>
> 10:45:24.330501 IP 10.246.17.83.36666 > 10.246.17.84.40648: . 5793:8689(2896) ack 1 win 457 <nop,nop,timestamp 65096620 137338877>
> 10:45:24.330665 IP 10.246.17.83.36666 > 10.246.17.84.40648: . 8689:11585(2896) ack 1 win 457 <nop,nop,timestamp 65096620 137338877>
> 10:45:24.330674 IP 10.246.17.83.36666 > 10.246.17.84.40648: . 11585:14481(2896) ack 1 win 457 <nop,nop,timestamp 65096620 137338877>
> 10:45:24.380592 IP 10.246.17.84.40648 > 10.246.17.83.36666: . ack 1449 win 249 <nop,nop,timestamp 137338927 65096620>
>
> echo 3 >/proc/sys/net/ipv4/tcp_min_tso_segs
>
> 10:48:51.558662 IP 10.246.17.83.44835 > 10.246.17.84.56145: S 2572155347:2572155347(0) win 29200 <mss 1460,sackOK,timestamp 65303848 0,nop,wscale 6>
> 10:48:51.608797 IP 10.246.17.84.56145 > 10.246.17.83.44835: S 2206641454:2206641454(0) ack 2572155348 win 28960 <mss 1460,sackOK,timestamp 137546155 65303848,nop,wscale 7>
> 10:48:51.608824 IP 10.246.17.83.44835 > 10.246.17.84.56145: . ack 1 win 457 <nop,nop,timestamp 65303898 137546155>
> 10:48:51.608901 IP 10.246.17.83.44835 > 10.246.17.84.56145: . 1:4345(4344) ack 1 win 457 <nop,nop,timestamp 65303898 137546155>
> 10:48:51.608911 IP 10.246.17.83.44835 > 10.246.17.84.56145: . 4345:8689(4344) ack 1 win 457 <nop,nop,timestamp 65303898 137546155>
> 10:48:51.608917 IP 10.246.17.83.44835 > 10.246.17.84.56145: . 8689:13033(4344) ack 1 win 457 <nop,nop,timestamp 65303898 137546155>
> 10:48:51.608927 IP 10.246.17.83.44835 > 10.246.17.84.56145: . 13033:14481(1448) ack 1 win 457 <nop,nop,timestamp 65303898 137546155>
> 10:48:51.659018 IP 10.246.17.84.56145 > 10.246.17.83.44835: . ack 1449 win 249 <nop,nop,timestamp 137546206 65303898>
> 10:48:51.659102 IP 10.246.17.83.44835 > 10.246.17.84.56145: . 14481:17377(2896) ack 1 win 457 <nop,nop,timestamp 65303948 137546206>
> 10:48:51.659019 IP 10.246.17.84.56145 > 10.246.17.83.44835: . ack 2897 win 272 <nop,nop,timestamp 137546206 65303898>
> 10:48:51.659113 IP 10.246.17.83.44835 > 10.246.17.84.56145: P 17377:18825(1448) ack 1 win 457 <nop,nop,timestamp 65303948 137546206>
> 10:48:51.659124 IP 10.246.17.84.56145 > 10.246.17.83.44835: . ack 4345 win 295 <nop,nop,timestamp 137546206 65303898>
>
> echo 4 >/proc/sys/net/ipv4/tcp_min_tso_segs
>
> 10:49:41.553016 IP 10.246.17.83.51499 > 10.246.17.84.37071: S 770187706:770187706(0) win 29200 <mss 1460,sackOK,timestamp 65353842 0,nop,wscale 6>
> 10:49:41.603149 IP 10.246.17.84.37071 > 10.246.17.83.51499: S 3342827191:3342827191(0) ack 770187707 win 28960 <mss 1460,sackOK,timestamp 137596150 65353842,nop,wscale 7>
> 10:49:41.603223 IP 10.246.17.83.51499 > 10.246.17.84.37071: . ack 1 win 457 <nop,nop,timestamp 65353892 137596150>
> 10:49:41.603307 IP 10.246.17.83.51499 > 10.246.17.84.37071: . 1:5793(5792) ack 1 win 457 <nop,nop,timestamp 65353893 137596150>
> 10:49:41.603317 IP 10.246.17.83.51499 > 10.246.17.84.37071: . 5793:11585(5792) ack 1 win 457 <nop,nop,timestamp 65353893 137596150>
> 10:49:41.603329 IP 10.246.17.83.51499 > 10.246.17.84.37071: . 11585:14481(2896) ack 1 win 457 <nop,nop,timestamp 65353893 137596150>
> 10:49:41.653448 IP 10.246.17.84.37071 > 10.246.17.83.51499: . ack 1449 win 249 <nop,nop,timestamp 137596200 65353893>
> 10:49:41.653531 IP 10.246.17.83.51499 > 10.246.17.84.37071: . 14481:17377(2896) ack 1 win 457 <nop,nop,timestamp 65353943 137596200>
> 10:49:41.653450 IP 10.246.17.84.37071 > 10.246.17.83.51499: . ack 2897 win 272 <nop,nop,timestamp 137596200 65353893>
> 10:49:41.653618 IP 10.246.17.83.51499 > 10.246.17.84.37071: . 17377:20273(2896) ack 1 win 457 <nop,nop,timestamp 65353943 137596200>
>
> echo 5 >/proc/sys/net/ipv4/tcp_min_tso_segs
>
> 10:50:33.626270 IP 10.246.17.83.52633 > 10.246.17.84.33693: S 1635294551:1635294551(0) win 29200 <mss 1460,sackOK,timestamp 65405916 0,nop,wscale 6>
> 10:50:33.676407 IP 10.246.17.84.33693 > 10.246.17.83.52633: S 1023650170:1023650170(0) ack 1635294552 win 28960 <mss 1460,sackOK,timestamp 137648223 65405916,nop,wscale 7>
> 10:50:33.676489 IP 10.246.17.83.52633 > 10.246.17.84.33693: . ack 1 win 457 <nop,nop,timestamp 65405966 137648223>
> 10:50:33.676571 IP 10.246.17.83.52633 > 10.246.17.84.33693: . 1:7241(7240) ack 1 win 457 <nop,nop,timestamp 65405966 137648223>
> 10:50:33.676578 IP 10.246.17.83.52633 > 10.246.17.84.33693: . 7241:14481(7240) ack 1 win 457 <nop,nop,timestamp 65405966 137648223>
> 10:50:33.726706 IP 10.246.17.84.33693 > 10.246.17.83.52633: . ack 1449 win 249 <nop,nop,timestamp 137648273 65405966>
> 10:50:33.726707 IP 10.246.17.84.33693 > 10.246.17.83.52633: . ack 2897 win 272 <nop,nop,timestamp 137648273 65405966>
> 10:50:33.726792 IP 10.246.17.83.52633 > 10.246.17.84.33693: . 14481:20273(5792) ack 1 win 457 <nop,nop,timestamp 65406016 137648273>
> 10:50:33.726781 IP 10.246.17.84.33693 > 10.246.17.83.52633: . ack 4345 win 295 <nop,nop,timestamp 137648273 65405966>
> 10:50:33.726986 IP 10.246.17.84.33693 > 10.246.17.83.52633: . ack 5793 win 317 <nop,nop,timestamp 137648274 65405966>
> 10:50:33.727101 IP 10.246.17.84.33693 > 10.246.17.83.52633: . ack 7241 win 340 <nop,nop,timestamp 137648274 65405966>
> 10:50:33.727117 IP 10.246.17.83.52633 > 10.246.17.84.33693: P 20273:27513(7240) ack 1 win 457 <nop,nop,timestamp 65406016 137648274>
> 10:50:33.727258 IP 10.246.17.84.33693 > 10.246.17.83.52633: . ack 8689 win 340 <nop,nop,timestamp 137648274 65405966>
> 10:50:33.727408 IP 10.246.17.84.33693 > 10.246.17.83.52633: . ack 10137 win 340 <nop,nop,timestamp 137648274 65405966>
>
> echo 6 >/proc/sys/net/ipv4/tcp_min_tso_segs
>
> 10:51:23.295063 IP 10.246.17.83.49096 > 10.246.17.84.43872: S 1841824181:1841824181(0) win 29200 <mss 1460,sackOK,timestamp 65455584 0,nop,wscale 6>
> 10:51:23.345207 IP 10.246.17.84.43872 > 10.246.17.83.49096: S 2837501410:2837501410(0) ack 1841824182 win 28960 <mss 1460,sackOK,timestamp 137697892 65455584,nop,wscale 7>
> 10:51:23.345237 IP 10.246.17.83.49096 > 10.246.17.84.43872: . ack 1 win 457 <nop,nop,timestamp 65455635 137697892>
> 10:51:23.345311 IP 10.246.17.83.49096 > 10.246.17.84.43872: . 1:8689(8688) ack 1 win 457 <nop,nop,timestamp 65455635 137697892>
> 10:51:23.345330 IP 10.246.17.83.49096 > 10.246.17.84.43872: . 8689:14481(5792) ack 1 win 457 <nop,nop,timestamp 65455635 137697892>
> 10:51:23.395453 IP 10.246.17.84.43872 > 10.246.17.83.49096: . ack 1449 win 249 <nop,nop,timestamp 137697942 65455635>
> 10:51:23.395454 IP 10.246.17.84.43872 > 10.246.17.83.49096: . ack 2897 win 272 <nop,nop,timestamp 137697942 65455635>
> 10:51:23.395544 IP 10.246.17.83.49096 > 10.246.17.84.43872: . 14481:20273(5792) ack 1 win 457 <nop,nop,timestamp 65455685 137697942>
> 10:51:23.395533 IP 10.246.17.84.43872 > 10.246.17.83.49096: . ack 4345 win 295 <nop,nop,timestamp 137697942 65455635>
> 10:51:23.395631 IP 10.246.17.83.49096 > 10.246.17.84.43872: . 20273:23169(2896) ack 1 win 457 <nop,nop,timestamp 65455685 137697942>
> 10:51:23.395746 IP 10.246.17.84.43872 > 10.246.17.83.49096: . ack 5793 win 317 <nop,nop,timestamp 137697942 65455635>
> 10:51:23.395854 IP 10.246.17.84.43872 > 10.246.17.83.49096: . ack 7241 win 340 <nop,nop,timestamp 137697943 65455635>
> 10:51:23.396049 IP 10.246.17.84.43872 > 10.246.17.83.49096: . ack 8689 win 340 <nop,nop,timestamp 137697943 65455635>
> 10:51:23.396199 IP 10.246.17.83.49096 > 10.246.17.84.43872: P 23169:31857(8688) ack 1 win 457 <nop,nop,timestamp 65455685 137697943>
>
> echo 7 >/proc/sys/net/ipv4/tcp_min_tso_segs
>
> 10:51:58.219334 IP 10.246.17.83.58882 > 10.246.17.84.41983: S 3763353310:3763353310(0) win 29200 <mss 1460,sackOK,timestamp 65490509 0,nop,wscale 6>
> 10:51:58.269455 IP 10.246.17.84.41983 > 10.246.17.83.58882: S 1445588492:1445588492(0) ack 3763353311 win 28960 <mss 1460,sackOK,timestamp 137732816 65490509,nop,wscale 7>
> 10:51:58.269536 IP 10.246.17.83.58882 > 10.246.17.84.41983: . ack 1 win 457 <nop,nop,timestamp 65490559 137732816>
> 10:51:58.269634 IP 10.246.17.83.58882 > 10.246.17.84.41983: . 1:10137(10136) ack 1 win 457 <nop,nop,timestamp 65490559 137732816>
> 10:51:58.269646 IP 10.246.17.83.58882 > 10.246.17.84.41983: . 10137:14481(4344) ack 1 win 457 <nop,nop,timestamp 65490559 137732816>
> 10:51:58.319765 IP 10.246.17.84.41983 > 10.246.17.83.58882: . ack 1449 win 249 <nop,nop,timestamp 137732866 65490559>
> 10:51:58.319846 IP 10.246.17.83.58882 > 10.246.17.84.41983: . 14481:17377(2896) ack 1 win 457 <nop,nop,timestamp 65490609 137732866>
> 10:51:58.319767 IP 10.246.17.84.41983 > 10.246.17.83.58882: . ack 2897 win 272 <nop,nop,timestamp 137732866 65490559>
> 10:51:58.319843 IP 10.246.17.84.41983 > 10.246.17.83.58882: . ack 4345 win 295 <nop,nop,timestamp 137732867 65490559>
> 10:51:58.319911 IP 10.246.17.83.58882 > 10.246.17.84.41983: . 17377:23169(5792) ack 1 win 457 <nop,nop,timestamp 65490609 137732867>
> 10:51:58.320068 IP 10.246.17.84.41983 > 10.246.17.83.58882: . ack 5793 win 317 <nop,nop,timestamp 137732867 65490559>
> 10:51:58.320180 IP 10.246.17.84.41983 > 10.246.17.83.58882: . ack 7241 win 340 <nop,nop,timestamp 137732867 65490559>
> 10:51:58.320287 IP 10.246.17.84.41983 > 10.246.17.83.58882: . ack 8689 win 340 <nop,nop,timestamp 137732867 65490559>
> 10:51:58.320295 IP 10.246.17.83.58882 > 10.246.17.84.41983: . 23169:31857(8688) ack 1 win 457 <nop,nop,timestamp 65490610 137732867>
> 10:51:58.320496 IP 10.246.17.84.41983 > 10.246.17.83.58882: . ack 10137 win 340 <nop,nop,timestamp 137732867 65490559>
> 10:51:58.320513 IP 10.246.17.83.58882 > 10.246.17.84.41983: . 31857:33305(1448) ack 1 win 457 <nop,nop,timestamp 65490610 137732867>
>
> echo 8 >/proc/sys/net/ipv4/tcp_min_tso_segs
>
> 10:52:50.398941 IP 10.246.17.83.32908 > 10.246.17.84.65099: S 678482142:678482142(0) win 29200 <mss 1460,sackOK,timestamp 65542688 0,nop,wscale 6>
> 10:52:50.449061 IP 10.246.17.84.65099 > 10.246.17.83.32908: S 3229813359:3229813359(0) ack 678482143 win 28960 <mss 1460,sackOK,timestamp 137784996 65542688,nop,wscale 7>
> 10:52:50.449146 IP 10.246.17.83.32908 > 10.246.17.84.65099: . ack 1 win 457 <nop,nop,timestamp 65542738 137784996>
> 10:52:50.449258 IP 10.246.17.83.32908 > 10.246.17.84.65099: . 1:11585(11584) ack 1 win 457 <nop,nop,timestamp 65542739 137784996>
> 10:52:50.449384 IP 10.246.17.83.32908 > 10.246.17.84.65099: . 11585:14481(2896) ack 1 win 457 <nop,nop,timestamp 65542739 137784996>
> 10:52:50.499379 IP 10.246.17.84.65099 > 10.246.17.83.32908: . ack 1449 win 249 <nop,nop,timestamp 137785046 65542739>
> 10:52:50.499462 IP 10.246.17.83.32908 > 10.246.17.84.65099: . 14481:17377(2896) ack 1 win 457 <nop,nop,timestamp 65542789 137785046>
> 10:52:50.499381 IP 10.246.17.84.65099 > 10.246.17.83.32908: . ack 2897 win 272 <nop,nop,timestamp 137785046 65542739>
> 10:52:50.499552 IP 10.246.17.83.32908 > 10.246.17.84.65099: . 17377:20273(2896) ack 1 win 457 <nop,nop,timestamp 65542789 137785046>
> 10:52:50.499552 IP 10.246.17.84.65099 > 10.246.17.83.32908: . ack 4345 win 295 <nop,nop,timestamp 137785046 65542739>
> 10:52:50.499661 IP 10.246.17.84.65099 > 10.246.17.83.32908: . ack 5793 win 317 <nop,nop,timestamp 137785046 65542739>
> 10:52:50.499806 IP 10.246.17.84.65099 > 10.246.17.83.32908: . ack 7241 win 340 <nop,nop,timestamp 137785046 65542739>
> 10:52:50.499845 IP 10.246.17.83.32908 > 10.246.17.84.65099: . 20273:28961(8688) ack 1 win 457 <nop,nop,timestamp 65542789 137785046>
> 10:52:50.500006 IP 10.246.17.84.65099 > 10.246.17.83.32908: . ack 8689 win 340 <nop,nop,timestamp 137785047 65542739>
>
> echo 9 >/proc/sys/net/ipv4/tcp_min_tso_segs
>
> 10:53:31.504788 IP 10.246.17.83.59687 > 10.246.17.84.38716: S 1238515537:1238515537(0) win 29200 <mss 1460,sackOK,timestamp 65583794 0,nop,wscale 6>
> 10:53:31.554898 IP 10.246.17.84.38716 > 10.246.17.83.59687: S 667062900:667062900(0) ack 1238515538 win 28960 <mss 1460,sackOK,timestamp 137826102 65583794,nop,wscale 7>
> 10:53:31.554973 IP 10.246.17.83.59687 > 10.246.17.84.38716: . ack 1 win 457 <nop,nop,timestamp 65583844 137826102>
> 10:53:31.555050 IP 10.246.17.83.59687 > 10.246.17.84.38716: . 1:13033(13032) ack 1 win 457 <nop,nop,timestamp 65583844 137826102>
> 10:53:31.555072 IP 10.246.17.83.59687 > 10.246.17.84.38716: . 13033:14481(1448) ack 1 win 457 <nop,nop,timestamp 65583844 137826102>
> 10:53:31.605154 IP 10.246.17.84.38716 > 10.246.17.83.59687: . ack 1449 win 249 <nop,nop,timestamp 137826152 65583844>
> 10:53:31.605235 IP 10.246.17.83.59687 > 10.246.17.84.38716: . 14481:17377(2896) ack 1 win 457 <nop,nop,timestamp 65583895 137826152>
> 10:53:31.605156 IP 10.246.17.84.38716 > 10.246.17.83.59687: . ack 2897 win 272 <nop,nop,timestamp 137826152 65583844>
> 10:53:31.605293 IP 10.246.17.84.38716 > 10.246.17.83.59687: . ack 4345 win 295 <nop,nop,timestamp 137826152 65583844>
> 10:53:31.605325 IP 10.246.17.83.59687 > 10.246.17.84.38716: . 17377:23169(5792) ack 1 win 457 <nop,nop,timestamp 65583895 137826152>
> 10:53:31.605461 IP 10.246.17.84.38716 > 10.246.17.83.59687: . ack 5793 win 317 <nop,nop,timestamp 137826152 65583844>
> 10:53:31.605599 IP 10.246.17.84.38716 > 10.246.17.83.59687: . ack 7241 win 340 <nop,nop,timestamp 137826152 65583844>
> 10:53:31.605750 IP 10.246.17.84.38716 > 10.246.17.83.59687: . ack 8689 win 340 <nop,nop,timestamp 137826152 65583844>
> 10:53:31.605834 IP 10.246.17.83.59687 > 10.246.17.84.38716: . 23169:31857(8688) ack 1 win 457 <nop,nop,timestamp 65583895 137826152>
> 10:53:31.605899 IP 10.246.17.84.38716 > 10.246.17.83.59687: . ack 10137 win 340 <nop,nop,timestamp 137826153 65583844>
> 10:53:31.606055 IP 10.246.17.84.38716 > 10.246.17.83.59687: . ack 11585 win 340 <nop,nop,timestamp 137826153 65583844>
> 10:53:31.606155 IP 10.246.17.83.59687 > 10.246.17.84.38716: . 31857:36201(4344) ack 1 win 457 <nop,nop,timestamp 65583895 137826153>
> 10:53:31.606157 IP 10.246.17.84.38716 > 10.246.17.83.59687: . ack 13033 win 340 <nop,nop,timestamp 137826153 65583844>
>
> echo 10 >/proc/sys/net/ipv4/tcp_min_tso_segs
>
> 10:54:15.974831 IP 10.246.17.83.53733 > 10.246.17.84.34163: S 690526362:690526362(0) win 29200 <mss 1460,sackOK,timestamp 65628264 0,nop,wscale 6>
> 10:54:16.024978 IP 10.246.17.84.34163 > 10.246.17.83.53733: S 1914393851:1914393851(0) ack 690526363 win 28960 <mss 1460,sackOK,timestamp 137870572 65628264,nop,wscale 7>
> 10:54:16.025047 IP 10.246.17.83.53733 > 10.246.17.84.34163: . ack 1 win 457 <nop,nop,timestamp 65628314 137870572>
> 10:54:16.025132 IP 10.246.17.83.53733 > 10.246.17.84.34163: . 1:14481(14480) ack 1 win 457 <nop,nop,timestamp 65628314 137870572>
> 10:54:16.075247 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 1449 win 249 <nop,nop,timestamp 137870622 65628314>
> 10:54:16.075249 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 2897 win 272 <nop,nop,timestamp 137870622 65628314>
> 10:54:16.075334 IP 10.246.17.83.53733 > 10.246.17.84.34163: . 14481:20273(5792) ack 1 win 457 <nop,nop,timestamp 65628365 137870622>
> 10:54:16.075452 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 4345 win 295 <nop,nop,timestamp 137870622 65628314>
> 10:54:16.075570 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 5793 win 317 <nop,nop,timestamp 137870622 65628314>
> 10:54:16.075674 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 7241 win 340 <nop,nop,timestamp 137870622 65628314>
> 10:54:16.075698 IP 10.246.17.83.53733 > 10.246.17.84.34163: . 20273:28961(8688) ack 1 win 457 <nop,nop,timestamp 65628365 137870622>
> 10:54:16.075833 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 8689 win 340 <nop,nop,timestamp 137870622 65628314>
> 10:54:16.075990 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 10137 win 340 <nop,nop,timestamp 137870623 65628314>
> 10:54:16.076116 IP 10.246.17.83.53733 > 10.246.17.84.34163: . 28961:34753(5792) ack 1 win 457 <nop,nop,timestamp 65628365 137870623>
> 10:54:16.076096 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 11585 win 340 <nop,nop,timestamp 137870623 65628314>
> 10:54:16.076291 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 13033 win 340 <nop,nop,timestamp 137870623 65628314>
> 10:54:16.076435 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 14481 win 340 <nop,nop,timestamp 137870623 65628314>
> 10:54:16.125492 IP 10.246.17.84.34163 > 10.246.17.83.53733: . ack 15929 win 340 <nop,nop,timestamp 137870672 65628365>
> 10:54:16.125569 IP 10.246.17.83.53733 > 10.246.17.84.34163: . 34753:46337(11584) ack 1 win 457 <nop,nop,timestamp 65628415 137870672>
>
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next] tcp: TSO packets automatic sizing
  2013-08-25 22:01     ` Yuchung Cheng
@ 2013-08-26  0:37       ` Eric Dumazet
  2013-08-26  2:22         ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2013-08-26  0:37 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Neal Cardwell, David Miller, netdev, Van Jacobson, Tom Herbert

On Sun, 2013-08-25 at 15:01 -0700, Yuchung Cheng wrote:

> Any idea to get rid of this undesirable extra RTT delay?

Its probably a bug in the push code.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next] tcp: TSO packets automatic sizing
  2013-08-26  0:37       ` Eric Dumazet
@ 2013-08-26  2:22         ` Eric Dumazet
  2013-08-26  3:58           ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2013-08-26  2:22 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Neal Cardwell, David Miller, netdev, Van Jacobson, Tom Herbert

On Sun, 2013-08-25 at 17:37 -0700, Eric Dumazet wrote:
> On Sun, 2013-08-25 at 15:01 -0700, Yuchung Cheng wrote:
> 
> > Any idea to get rid of this undesirable extra RTT delay?
> 
> Its probably a bug in the push code.

For exact write/send of a multiple of MSS, I think following patch
should fix the bug.

If we filled a packet, we must send it.

For the other problem, I think its related to Nagle.

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ab64eea..dd326f4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1187,7 +1187,8 @@ new_segment:
 
 			from += copy;
 			copied += copy;
-			if ((seglen -= copy) == 0 && iovlen == 0)
+			seglen -= copy;
+			if (seglen == 0 && iovlen == 0 && skb->len < max)
 				goto out;
 
 			if (skb->len < max || (flags & MSG_OOB) || unlikely(tp->repair))

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next] tcp: TSO packets automatic sizing
  2013-08-26  2:22         ` Eric Dumazet
@ 2013-08-26  3:58           ` Eric Dumazet
  0 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2013-08-26  3:58 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Neal Cardwell, David Miller, netdev, Van Jacobson, Tom Herbert

On Sun, 2013-08-25 at 19:22 -0700, Eric Dumazet wrote:
> On Sun, 2013-08-25 at 17:37 -0700, Eric Dumazet wrote:
> > On Sun, 2013-08-25 at 15:01 -0700, Yuchung Cheng wrote:
> > 
> > > Any idea to get rid of this undesirable extra RTT delay?
> > 
> > Its probably a bug in the push code.
> 
> For exact write/send of a multiple of MSS, I think following patch
> should fix the bug.
> 
> If we filled a packet, we must send it.
> 
> For the other problem, I think its related to Nagle.

Oh well, thats the tcp_tso_should_defer() again.

        /* If a full-sized TSO skb can be sent, do it. */
        if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
                           sk->sk_gso_max_segs * tp->mss_cache))
                goto send_now;

I'll send a V2 of the patch

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next] tcp: TSO packets automatic sizing
  2013-08-24  0:29 [PATCH net-next] tcp: TSO packets automatic sizing Eric Dumazet
  2013-08-24  3:17 ` Neal Cardwell
@ 2013-08-25  2:46 ` David Miller
  2013-08-25  2:52   ` Eric Dumazet
  2013-08-26  4:26 ` [PATCH v2 " Eric Dumazet
  2 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2013-08-25  2:46 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, ncardwell, ycheng, vanj, therbert

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 23 Aug 2013 17:29:52 -0700

> After hearing many people over past years complaining against TSO being
> bursty or even buggy, we are proud to present automatic sizing of TSO
> packets.

Looks great.

> +	pr_debug("cwnd %u packets_out %u srtt %u -> rate = %llu bits\n",
> +		 tp->snd_cwnd, tp->packets_out,
> +		 jiffies_to_usecs(tp->srtt) >> 3, rate << 3);

I'd suggest you remove this though.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next] tcp: TSO packets automatic sizing
  2013-08-25  2:46 ` David Miller
@ 2013-08-25  2:52   ` Eric Dumazet
  0 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2013-08-25  2:52 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, ncardwell, ycheng, vanj, therbert

On Sat, 2013-08-24 at 22:46 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 23 Aug 2013 17:29:52 -0700
> 
> > After hearing many people over past years complaining against TSO being
> > bursty or even buggy, we are proud to present automatic sizing of TSO
> > packets.
> 
> Looks great.
> 
> > +	pr_debug("cwnd %u packets_out %u srtt %u -> rate = %llu bits\n",
> > +		 tp->snd_cwnd, tp->packets_out,
> > +		 jiffies_to_usecs(tp->srtt) >> 3, rate << 3);
> 
> I'd suggest you remove this though.
> 

Sure, but I found this very useful while debugging sch_fq.

CTRL=/sys/kernel/debug/dynamic_debug/control
echo "func tcp_rtt_estimator +p" >$CTRL
./netperf -l -1000000 -H lpq84
echo "func tcp_rtt_estimator -p" >$CTRL 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 net-next] tcp: TSO packets automatic sizing
  2013-08-24  0:29 [PATCH net-next] tcp: TSO packets automatic sizing Eric Dumazet
  2013-08-24  3:17 ` Neal Cardwell
  2013-08-25  2:46 ` David Miller
@ 2013-08-26  4:26 ` Eric Dumazet
  2013-08-26 19:09   ` Yuchung Cheng
                     ` (2 more replies)
  2 siblings, 3 replies; 25+ messages in thread
From: Eric Dumazet @ 2013-08-26  4:26 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Neal Cardwell, Yuchung Cheng, Van Jacobson, Tom Herbert

From: Eric Dumazet <edumazet@google.com>

After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.

One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.

This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.

This field could be set by other transports.

Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.

For other flows, this helps better packet scheduling and ACK clocking.

This patch increases performance of TCP flows in lossy environments.

A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).

A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.

This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.

sk_pacing_rate = 2 * cwnd * mss / srtt
 
v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs 

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Tom Herbert <therbert@google.com>
---
 Documentation/networking/ip-sysctl.txt |    9 ++++++
 include/net/sock.h                     |    2 +
 include/net/tcp.h                      |    1 
 net/ipv4/sysctl_net_ipv4.c             |   10 +++++++
 net/ipv4/tcp.c                         |   28 +++++++++++++++++---
 net/ipv4/tcp_input.c                   |   31 ++++++++++++++++++++++-
 net/ipv4/tcp_output.c                  |    2 -
 7 files changed, 76 insertions(+), 7 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index debfe85..ce5bb43 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -482,6 +482,15 @@ tcp_syn_retries - INTEGER
 tcp_timestamps - BOOLEAN
 	Enable timestamps as defined in RFC1323.
 
+tcp_min_tso_segs - INTEGER
+	Minimal number of segments per TSO frame.
+	Since linux-3.12, TCP does an automatic sizing of TSO frames,
+	depending on flow rate, instead of filling 64Kbytes packets.
+	For specific usages, it's possible to force TCP to build big
+	TSO frames. Note that TCP stack might split too big TSO packets
+	if available window is too small.
+	Default: 2
+
 tcp_tso_win_divisor - INTEGER
 	This allows control over what percentage of the congestion window
 	can be consumed by a single TSO frame.
diff --git a/include/net/sock.h b/include/net/sock.h
index e4bbcbf..6ba2e7b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -232,6 +232,7 @@ struct cg_proto;
   *	@sk_napi_id: id of the last napi context to receive data for sk
   *	@sk_ll_usec: usecs to busypoll when there is no data
   *	@sk_allocation: allocation mode
+  *	@sk_pacing_rate: Pacing rate (if supported by transport/packet scheduler)
   *	@sk_sndbuf: size of send buffer in bytes
   *	@sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
   *		   %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
@@ -361,6 +362,7 @@ struct sock {
 	kmemcheck_bitfield_end(flags);
 	int			sk_wmem_queued;
 	gfp_t			sk_allocation;
+	u32			sk_pacing_rate; /* bytes per second */
 	netdev_features_t	sk_route_caps;
 	netdev_features_t	sk_route_nocaps;
 	int			sk_gso_type;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 09cb5c1..73fcd7c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -281,6 +281,7 @@ extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
 extern unsigned int sysctl_tcp_notsent_lowat;
+extern int sysctl_tcp_min_tso_segs;
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 8ed7c32..540279f 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -29,6 +29,7 @@
 static int zero;
 static int one = 1;
 static int four = 4;
+static int gso_max_segs = GSO_MAX_SEGS;
 static int tcp_retr1_max = 255;
 static int ip_local_port_range_min[] = { 1, 1 };
 static int ip_local_port_range_max[] = { 65535, 65535 };
@@ -761,6 +762,15 @@ static struct ctl_table ipv4_table[] = {
 		.extra2		= &four,
 	},
 	{
+		.procname	= "tcp_min_tso_segs",
+		.data		= &sysctl_tcp_min_tso_segs,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &gso_max_segs,
+	},
+	{
 		.procname	= "udp_mem",
 		.data		= &sysctl_udp_mem,
 		.maxlen		= sizeof(sysctl_udp_mem),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index ab64eea..e1714ee 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -283,6 +283,8 @@
 
 int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
 
+int sysctl_tcp_min_tso_segs __read_mostly = 2;
+
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
@@ -785,12 +787,28 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
 	xmit_size_goal = mss_now;
 
 	if (large_allowed && sk_can_gso(sk)) {
-		xmit_size_goal = ((sk->sk_gso_max_size - 1) -
-				  inet_csk(sk)->icsk_af_ops->net_header_len -
-				  inet_csk(sk)->icsk_ext_hdr_len -
-				  tp->tcp_header_len);
+		u32 gso_size, hlen;
+
+		/* Maybe we should/could use sk->sk_prot->max_header here ? */
+		hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
+		       inet_csk(sk)->icsk_ext_hdr_len +
+		       tp->tcp_header_len;
+
+		/* Goal is to send at least one packet per ms,
+		 * not one big TSO packet every 100 ms.
+		 * This preserves ACK clocking and is consistent
+		 * with tcp_tso_should_defer() heuristic.
+		 */
+		gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC);
+		gso_size = max_t(u32, gso_size,
+				 sysctl_tcp_min_tso_segs * mss_now);
+
+		xmit_size_goal = min_t(u32, gso_size,
+				       sk->sk_gso_max_size - 1 - hlen);
 
-		/* TSQ : try to have two TSO segments in flight */
+		/* TSQ : try to have at least two segments in flight
+		 * (one in NIC TX ring, another in Qdisc)
+		 */
 		xmit_size_goal = min_t(u32, xmit_size_goal,
 				       sysctl_tcp_limit_output_bytes >> 1);
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ec492ea..3d63db7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -688,6 +688,33 @@ static void tcp_rtt_estimator(struct sock *sk, const __u32 mrtt)
 	}
 }
 
+/* Set sk_pacing_rate to allow proper sizing of TSO packets.
+ * Note: TCP stack does not yet implement pacing.
+ * FQ packet scheduler can be used to implement cheap but effective
+ * TCP pacing, to smooth the burst on large writes when packets
+ * in flight is significantly lower than cwnd (or rwin)
+ */
+static void tcp_update_pacing_rate(struct sock *sk)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	u64 rate;
+
+	/* set sk_pacing_rate to 200 % of current rate (mss * cwnd / rtt) */
+	rate = (u64)tp->mss_cache * 8 * 2 * USEC_PER_SEC;
+
+	rate *= max(tp->snd_cwnd, tp->packets_out);
+
+	do_div(rate, jiffies_to_usecs(tp->srtt));
+
+	/* Correction for small srtt : minimum srtt being 8 (1 jiffy),
+	 * be conservative and assume rtt = 1/(8*HZ) instead of 1/HZ s
+	 * We probably need usec resolution in the future.
+	 */
+	if (tp->srtt <= 8 + 2)
+		rate <<= 3;
+	sk->sk_pacing_rate = min_t(u64, rate, ~0U);
+}
+
 /* Calculate rto without backoff.  This is the second half of Van Jacobson's
  * routine referred to above.
  */
@@ -3278,7 +3305,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	u32 ack_seq = TCP_SKB_CB(skb)->seq;
 	u32 ack = TCP_SKB_CB(skb)->ack_seq;
 	bool is_dupack = false;
-	u32 prior_in_flight;
+	u32 prior_in_flight, prior_cwnd = tp->snd_cwnd, prior_rtt = tp->srtt;
 	u32 prior_fackets;
 	int prior_packets = tp->packets_out;
 	const int prior_unsacked = tp->packets_out - tp->sacked_out;
@@ -3383,6 +3410,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 
 	if (icsk->icsk_pending == ICSK_TIME_RETRANS)
 		tcp_schedule_loss_probe(sk);
+	if (tp->srtt != prior_rtt || tp->snd_cwnd != prior_cwnd)
+		tcp_update_pacing_rate(sk);
 	return 1;
 
 no_queue:
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 884efff..e63ae4c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1631,7 +1631,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
 
 	/* If a full-sized TSO skb can be sent, do it. */
 	if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
-			   sk->sk_gso_max_segs * tp->mss_cache))
+			   tp->xmit_size_goal_segs * tp->mss_cache))
 		goto send_now;
 
 	/* Middle in queue won't get any more data, full sendable already? */

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 net-next] tcp: TSO packets automatic sizing
  2013-08-26  4:26 ` [PATCH v2 " Eric Dumazet
@ 2013-08-26 19:09   ` Yuchung Cheng
  2013-08-26 20:28     ` Eric Dumazet
  2013-08-27  0:47   ` Eric Dumazet
  2013-08-27 12:46   ` [PATCH v3 " Eric Dumazet
  2 siblings, 1 reply; 25+ messages in thread
From: Yuchung Cheng @ 2013-08-26 19:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Neal Cardwell, Van Jacobson, Tom Herbert

On Sun, Aug 25, 2013 at 9:26 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> After hearing many people over past years complaining against TSO being
> bursty or even buggy, we are proud to present automatic sizing of TSO
> packets.
>
> One part of the problem is that tcp_tso_should_defer() uses an heuristic
> relying on upcoming ACKS instead of a timer, but more generally, having
> big TSO packets makes little sense for low rates, as it tends to create
> micro bursts on the network, and general consensus is to reduce the
> buffering amount.
>
> This patch introduces a per socket sk_pacing_rate, that approximates
> the current sending rate, and allows us to size the TSO packets so
> that we try to send one packet every ms.
>
> This field could be set by other transports.
>
> Patch has no impact for high speed flows, where having large TSO packets
> makes sense to reach line rate.
>
> For other flows, this helps better packet scheduling and ACK clocking.
>
> This patch increases performance of TCP flows in lossy environments.
>
> A new sysctl (tcp_min_tso_segs) is added, to specify the
> minimal size of a TSO packet (default being 2).
>
> A follow-up patch will provide a new packet scheduler (FQ), using
> sk_pacing_rate as an input to perform optional per flow pacing.
>
> This explains why we chose to set sk_pacing_rate to twice the current
> rate, allowing 'slow start' ramp up.
>
> sk_pacing_rate = 2 * cwnd * mss / srtt
>
> v2: Neal Cardwell reported a suspect deferring of last two segments on
> initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
> into account tp->xmit_size_goal_segs
init write of 10MSS now looks good, but the delay still shows up if 9
< write-size < 10 MSS...

I use packetdrill to do write(14599) bytes of MSS 1460

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < S 299245565:299245565(0) win 32792 <mss
1460,sackOK,nop,nop,nop,wscale 7>
0.100 > S. 0:0(0) ack 299245566 <mss 1460,nop,nop,sackOK,nop,wscale 6>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
+.000 write(4, ..., 13141) = 13141
+3 %{ print "done" }%

and tcpdump shows

31.819958 IP cli > srv: S 299245565:299245565(0) win 32792 <mss 1460,sackOK,nop,
nop,nop,wscale 7>
000034 IP srv > cli: S 203810874:203810874(0) ack 299245566 win 29200
<mss 1460,nop,nop,sackOK,nop,wscale 6>
099966 IP cli > srv: . ack 1 win 257
000079 IP srv > cli: . 1:2921(2920) ack 1 win 457
000010 IP srv > cli: . 2921:5841(2920) ack 1 win 457
000005 IP srv > cli: . 5841:8761(2920) ack 1 win 457
000004 IP srv > cli: . 8761:11681(2920) ack 1 win 457
199659 IP srv > cli: . 11681:13141(1460) ack 1 win 457
...


>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Van Jacobson <vanj@google.com>
> Cc: Tom Herbert <therbert@google.com>
> ---
>  Documentation/networking/ip-sysctl.txt |    9 ++++++
>  include/net/sock.h                     |    2 +
>  include/net/tcp.h                      |    1
>  net/ipv4/sysctl_net_ipv4.c             |   10 +++++++
>  net/ipv4/tcp.c                         |   28 +++++++++++++++++---
>  net/ipv4/tcp_input.c                   |   31 ++++++++++++++++++++++-
>  net/ipv4/tcp_output.c                  |    2 -
>  7 files changed, 76 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index debfe85..ce5bb43 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -482,6 +482,15 @@ tcp_syn_retries - INTEGER
>  tcp_timestamps - BOOLEAN
>         Enable timestamps as defined in RFC1323.
>
> +tcp_min_tso_segs - INTEGER
> +       Minimal number of segments per TSO frame.
> +       Since linux-3.12, TCP does an automatic sizing of TSO frames,
> +       depending on flow rate, instead of filling 64Kbytes packets.
> +       For specific usages, it's possible to force TCP to build big
> +       TSO frames. Note that TCP stack might split too big TSO packets
> +       if available window is too small.
> +       Default: 2
> +
>  tcp_tso_win_divisor - INTEGER
>         This allows control over what percentage of the congestion window
>         can be consumed by a single TSO frame.
> diff --git a/include/net/sock.h b/include/net/sock.h
> index e4bbcbf..6ba2e7b 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -232,6 +232,7 @@ struct cg_proto;
>    *    @sk_napi_id: id of the last napi context to receive data for sk
>    *    @sk_ll_usec: usecs to busypoll when there is no data
>    *    @sk_allocation: allocation mode
> +  *    @sk_pacing_rate: Pacing rate (if supported by transport/packet scheduler)
>    *    @sk_sndbuf: size of send buffer in bytes
>    *    @sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
>    *               %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
> @@ -361,6 +362,7 @@ struct sock {
>         kmemcheck_bitfield_end(flags);
>         int                     sk_wmem_queued;
>         gfp_t                   sk_allocation;
> +       u32                     sk_pacing_rate; /* bytes per second */
>         netdev_features_t       sk_route_caps;
>         netdev_features_t       sk_route_nocaps;
>         int                     sk_gso_type;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 09cb5c1..73fcd7c 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -281,6 +281,7 @@ extern int sysctl_tcp_early_retrans;
>  extern int sysctl_tcp_limit_output_bytes;
>  extern int sysctl_tcp_challenge_ack_limit;
>  extern unsigned int sysctl_tcp_notsent_lowat;
> +extern int sysctl_tcp_min_tso_segs;
>
>  extern atomic_long_t tcp_memory_allocated;
>  extern struct percpu_counter tcp_sockets_allocated;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 8ed7c32..540279f 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -29,6 +29,7 @@
>  static int zero;
>  static int one = 1;
>  static int four = 4;
> +static int gso_max_segs = GSO_MAX_SEGS;
>  static int tcp_retr1_max = 255;
>  static int ip_local_port_range_min[] = { 1, 1 };
>  static int ip_local_port_range_max[] = { 65535, 65535 };
> @@ -761,6 +762,15 @@ static struct ctl_table ipv4_table[] = {
>                 .extra2         = &four,
>         },
>         {
> +               .procname       = "tcp_min_tso_segs",
> +               .data           = &sysctl_tcp_min_tso_segs,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax,
> +               .extra1         = &zero,
> +               .extra2         = &gso_max_segs,
> +       },
> +       {
>                 .procname       = "udp_mem",
>                 .data           = &sysctl_udp_mem,
>                 .maxlen         = sizeof(sysctl_udp_mem),
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index ab64eea..e1714ee 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -283,6 +283,8 @@
>
>  int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
>
> +int sysctl_tcp_min_tso_segs __read_mostly = 2;
> +
>  struct percpu_counter tcp_orphan_count;
>  EXPORT_SYMBOL_GPL(tcp_orphan_count);
>
> @@ -785,12 +787,28 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
>         xmit_size_goal = mss_now;
>
>         if (large_allowed && sk_can_gso(sk)) {
> -               xmit_size_goal = ((sk->sk_gso_max_size - 1) -
> -                                 inet_csk(sk)->icsk_af_ops->net_header_len -
> -                                 inet_csk(sk)->icsk_ext_hdr_len -
> -                                 tp->tcp_header_len);
> +               u32 gso_size, hlen;
> +
> +               /* Maybe we should/could use sk->sk_prot->max_header here ? */
> +               hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
> +                      inet_csk(sk)->icsk_ext_hdr_len +
> +                      tp->tcp_header_len;
> +
> +               /* Goal is to send at least one packet per ms,
> +                * not one big TSO packet every 100 ms.
> +                * This preserves ACK clocking and is consistent
> +                * with tcp_tso_should_defer() heuristic.
> +                */
> +               gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC);
> +               gso_size = max_t(u32, gso_size,
> +                                sysctl_tcp_min_tso_segs * mss_now);
> +
> +               xmit_size_goal = min_t(u32, gso_size,
> +                                      sk->sk_gso_max_size - 1 - hlen);
>
> -               /* TSQ : try to have two TSO segments in flight */
> +               /* TSQ : try to have at least two segments in flight
> +                * (one in NIC TX ring, another in Qdisc)
> +                */
>                 xmit_size_goal = min_t(u32, xmit_size_goal,
>                                        sysctl_tcp_limit_output_bytes >> 1);
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index ec492ea..3d63db7 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -688,6 +688,33 @@ static void tcp_rtt_estimator(struct sock *sk, const __u32 mrtt)
>         }
>  }
>
> +/* Set sk_pacing_rate to allow proper sizing of TSO packets.
> + * Note: TCP stack does not yet implement pacing.
> + * FQ packet scheduler can be used to implement cheap but effective
> + * TCP pacing, to smooth the burst on large writes when packets
> + * in flight is significantly lower than cwnd (or rwin)
> + */
> +static void tcp_update_pacing_rate(struct sock *sk)
> +{
> +       const struct tcp_sock *tp = tcp_sk(sk);
> +       u64 rate;
> +
> +       /* set sk_pacing_rate to 200 % of current rate (mss * cwnd / rtt) */
> +       rate = (u64)tp->mss_cache * 8 * 2 * USEC_PER_SEC;
> +
> +       rate *= max(tp->snd_cwnd, tp->packets_out);
> +
> +       do_div(rate, jiffies_to_usecs(tp->srtt));
> +
> +       /* Correction for small srtt : minimum srtt being 8 (1 jiffy),
> +        * be conservative and assume rtt = 1/(8*HZ) instead of 1/HZ s
> +        * We probably need usec resolution in the future.
> +        */
> +       if (tp->srtt <= 8 + 2)
> +               rate <<= 3;
> +       sk->sk_pacing_rate = min_t(u64, rate, ~0U);
> +}
> +
>  /* Calculate rto without backoff.  This is the second half of Van Jacobson's
>   * routine referred to above.
>   */
> @@ -3278,7 +3305,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>         u32 ack_seq = TCP_SKB_CB(skb)->seq;
>         u32 ack = TCP_SKB_CB(skb)->ack_seq;
>         bool is_dupack = false;
> -       u32 prior_in_flight;
> +       u32 prior_in_flight, prior_cwnd = tp->snd_cwnd, prior_rtt = tp->srtt;
>         u32 prior_fackets;
>         int prior_packets = tp->packets_out;
>         const int prior_unsacked = tp->packets_out - tp->sacked_out;
> @@ -3383,6 +3410,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>
>         if (icsk->icsk_pending == ICSK_TIME_RETRANS)
>                 tcp_schedule_loss_probe(sk);
> +       if (tp->srtt != prior_rtt || tp->snd_cwnd != prior_cwnd)
> +               tcp_update_pacing_rate(sk);
>         return 1;
>
>  no_queue:
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 884efff..e63ae4c 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1631,7 +1631,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
>
>         /* If a full-sized TSO skb can be sent, do it. */
>         if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
> -                          sk->sk_gso_max_segs * tp->mss_cache))
> +                          tp->xmit_size_goal_segs * tp->mss_cache))
>                 goto send_now;
>
>         /* Middle in queue won't get any more data, full sendable already? */
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 net-next] tcp: TSO packets automatic sizing
  2013-08-26 19:09   ` Yuchung Cheng
@ 2013-08-26 20:28     ` Eric Dumazet
  2013-08-26 22:31       ` Yuchung Cheng
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2013-08-26 20:28 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: David Miller, netdev, Neal Cardwell, Van Jacobson, Tom Herbert

On Mon, 2013-08-26 at 12:09 -0700, Yuchung Cheng wrote:

> init write of 10MSS now looks good, but the delay still shows up if 9
> < write-size < 10 MSS...
> 
> I use packetdrill to do write(14599) bytes of MSS 1460
> 
> 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> 0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> 0.000 bind(3, ..., ...) = 0
> 0.000 listen(3, 1) = 0
> 
> 0.100 < S 299245565:299245565(0) win 32792 <mss
> 1460,sackOK,nop,nop,nop,wscale 7>
> 0.100 > S. 0:0(0) ack 299245566 <mss 1460,nop,nop,sackOK,nop,wscale 6>
> 0.200 < . 1:1(0) ack 1 win 257
> 0.200 accept(3, ..., ...) = 4
> +.000 write(4, ..., 13141) = 13141
> +3 %{ print "done" }%
> 
> and tcpdump shows
> 
> 31.819958 IP cli > srv: S 299245565:299245565(0) win 32792 <mss 1460,sackOK,nop,
> nop,nop,wscale 7>
> 000034 IP srv > cli: S 203810874:203810874(0) ack 299245566 win 29200
> <mss 1460,nop,nop,sackOK,nop,wscale 6>
> 099966 IP cli > srv: . ack 1 win 257
> 000079 IP srv > cli: . 1:2921(2920) ack 1 win 457
> 000010 IP srv > cli: . 2921:5841(2920) ack 1 win 457
> 000005 IP srv > cli: . 5841:8761(2920) ack 1 win 457
> 000004 IP srv > cli: . 8761:11681(2920) ack 1 win 457
> 199659 IP srv > cli: . 11681:13141(1460) ack 1 win 457
> ...

It works correctly here, I wonder what's happening on your host.

lpq83:~# tcpdump -p -n -s 0 -i any port 8080 -ttt
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
000000 IP 192.0.2.1.42148 > 192.168.0.1.8080: S 299245565:299245565(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
000042 IP 192.168.0.1.8080 > 192.0.2.1.42148: S 223873324:223873324(0) ack 299245566 win 29200 <mss 1460,nop,nop,sackOK,nop,wscale 6>
099946 IP 192.0.2.1.42148 > 192.168.0.1.8080: . ack 1 win 257
000087 IP 192.168.0.1.8080 > 192.0.2.1.42148: . 1:2921(2920) ack 1 win 457
000006 IP 192.168.0.1.8080 > 192.0.2.1.42148: . 2921:5841(2920) ack 1 win 457
000004 IP 192.168.0.1.8080 > 192.0.2.1.42148: . 5841:8761(2920) ack 1 win 457
000004 IP 192.168.0.1.8080 > 192.0.2.1.42148: . 8761:11681(2920) ack 1 win 457
000005 IP 192.168.0.1.8080 > 192.0.2.1.42148: . 11681:13141(1460) ack 1 win 457
000002 IP 192.168.0.1.8080 > 192.0.2.1.42148: P 13141:13142(1) ack 1 win 457

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 net-next] tcp: TSO packets automatic sizing
  2013-08-26 20:28     ` Eric Dumazet
@ 2013-08-26 22:31       ` Yuchung Cheng
  0 siblings, 0 replies; 25+ messages in thread
From: Yuchung Cheng @ 2013-08-26 22:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Neal Cardwell, Van Jacobson, Tom Herbert

On Mon, Aug 26, 2013 at 1:28 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2013-08-26 at 12:09 -0700, Yuchung Cheng wrote:
>
>> init write of 10MSS now looks good, but the delay still shows up if 9
>> < write-size < 10 MSS...
>>
>> I use packetdrill to do write(14599) bytes of MSS 1460
>>
>> 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>> 0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>> 0.000 bind(3, ..., ...) = 0
>> 0.000 listen(3, 1) = 0
>>
>> 0.100 < S 299245565:299245565(0) win 32792 <mss
>> 1460,sackOK,nop,nop,nop,wscale 7>
>> 0.100 > S. 0:0(0) ack 299245566 <mss 1460,nop,nop,sackOK,nop,wscale 6>
>> 0.200 < . 1:1(0) ack 1 win 257
>> 0.200 accept(3, ..., ...) = 4
>> +.000 write(4, ..., 13141) = 13141
>> +3 %{ print "done" }%
>>
>> and tcpdump shows
>>
>> 31.819958 IP cli > srv: S 299245565:299245565(0) win 32792 <mss 1460,sackOK,nop,
>> nop,nop,wscale 7>
>> 000034 IP srv > cli: S 203810874:203810874(0) ack 299245566 win 29200
>> <mss 1460,nop,nop,sackOK,nop,wscale 6>
>> 099966 IP cli > srv: . ack 1 win 257
>> 000079 IP srv > cli: . 1:2921(2920) ack 1 win 457
>> 000010 IP srv > cli: . 2921:5841(2920) ack 1 win 457
>> 000005 IP srv > cli: . 5841:8761(2920) ack 1 win 457
>> 000004 IP srv > cli: . 8761:11681(2920) ack 1 win 457
>> 199659 IP srv > cli: . 11681:13141(1460) ack 1 win 457
>> ...
>
> It works correctly here, I wonder what's happening on your host.
Sorry! I was testing the wrong patch so ignore my bogus report. I have
to run but will take another look of the new patch later today or
tomorrow.

>
> lpq83:~# tcpdump -p -n -s 0 -i any port 8080 -ttt
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
> 000000 IP 192.0.2.1.42148 > 192.168.0.1.8080: S 299245565:299245565(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7>
> 000042 IP 192.168.0.1.8080 > 192.0.2.1.42148: S 223873324:223873324(0) ack 299245566 win 29200 <mss 1460,nop,nop,sackOK,nop,wscale 6>
> 099946 IP 192.0.2.1.42148 > 192.168.0.1.8080: . ack 1 win 257
> 000087 IP 192.168.0.1.8080 > 192.0.2.1.42148: . 1:2921(2920) ack 1 win 457
> 000006 IP 192.168.0.1.8080 > 192.0.2.1.42148: . 2921:5841(2920) ack 1 win 457
> 000004 IP 192.168.0.1.8080 > 192.0.2.1.42148: . 5841:8761(2920) ack 1 win 457
> 000004 IP 192.168.0.1.8080 > 192.0.2.1.42148: . 8761:11681(2920) ack 1 win 457
> 000005 IP 192.168.0.1.8080 > 192.0.2.1.42148: . 11681:13141(1460) ack 1 win 457
> 000002 IP 192.168.0.1.8080 > 192.0.2.1.42148: P 13141:13142(1) ack 1 win 457
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 net-next] tcp: TSO packets automatic sizing
  2013-08-26  4:26 ` [PATCH v2 " Eric Dumazet
  2013-08-26 19:09   ` Yuchung Cheng
@ 2013-08-27  0:47   ` Eric Dumazet
  2013-08-27 12:46   ` [PATCH v3 " Eric Dumazet
  2 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2013-08-27  0:47 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Neal Cardwell, Yuchung Cheng, Van Jacobson, Tom Herbert

On Sun, 2013-08-25 at 21:26 -0700, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>

>  
> v2: Neal Cardwell reported a suspect deferring of last two segments on
> initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
> into account tp->xmit_size_goal_segs 

Please do not apply, will send a v3, because srtt can be zero in some
occasions (divide by 0)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 net-next] tcp: TSO packets automatic sizing
  2013-08-26  4:26 ` [PATCH v2 " Eric Dumazet
  2013-08-26 19:09   ` Yuchung Cheng
  2013-08-27  0:47   ` Eric Dumazet
@ 2013-08-27 12:46   ` Eric Dumazet
  2013-08-28  0:17     ` Yuchung Cheng
                       ` (3 more replies)
  2 siblings, 4 replies; 25+ messages in thread
From: Eric Dumazet @ 2013-08-27 12:46 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Neal Cardwell, Yuchung Cheng, Van Jacobson, Tom Herbert

From: Eric Dumazet <edumazet@google.com>

After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.

One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.

This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.

This field could be set by other transports.

Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.

For other flows, this helps better packet scheduling and ACK clocking.

This patch increases performance of TCP flows in lossy environments.

A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).

A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.

This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.

sk_pacing_rate = 2 * cwnd * mss / srtt
 
v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs 

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Tom Herbert <therbert@google.com>
---
v3: The change Yuchung suggested added a possibility of a divide by 0:
    On some (retransmits) case, srtt can be 0 because
    tcp_rtt_estimator() has not yet been called.
    Change the computation to remove this, and do not yet use usec
    as the units, but HZ. [ Its interesting to see jiffies_to_usecs()
    being an out of line function :( ]

This version passed all our tests.

 Documentation/networking/ip-sysctl.txt |    9 ++++++
 include/net/sock.h                     |    2 +
 include/net/tcp.h                      |    1 
 net/ipv4/sysctl_net_ipv4.c             |   10 +++++++
 net/ipv4/tcp.c                         |   28 ++++++++++++++++----
 net/ipv4/tcp_input.c                   |   32 ++++++++++++++++++++++-
 net/ipv4/tcp_output.c                  |    2 -
 7 files changed, 77 insertions(+), 7 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index debfe85..ce5bb43 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -482,6 +482,15 @@ tcp_syn_retries - INTEGER
 tcp_timestamps - BOOLEAN
 	Enable timestamps as defined in RFC1323.
 
+tcp_min_tso_segs - INTEGER
+	Minimal number of segments per TSO frame.
+	Since linux-3.12, TCP does an automatic sizing of TSO frames,
+	depending on flow rate, instead of filling 64Kbytes packets.
+	For specific usages, it's possible to force TCP to build big
+	TSO frames. Note that TCP stack might split too big TSO packets
+	if available window is too small.
+	Default: 2
+
 tcp_tso_win_divisor - INTEGER
 	This allows control over what percentage of the congestion window
 	can be consumed by a single TSO frame.
diff --git a/include/net/sock.h b/include/net/sock.h
index e4bbcbf..6ba2e7b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -232,6 +232,7 @@ struct cg_proto;
   *	@sk_napi_id: id of the last napi context to receive data for sk
   *	@sk_ll_usec: usecs to busypoll when there is no data
   *	@sk_allocation: allocation mode
+  *	@sk_pacing_rate: Pacing rate (if supported by transport/packet scheduler)
   *	@sk_sndbuf: size of send buffer in bytes
   *	@sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
   *		   %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
@@ -361,6 +362,7 @@ struct sock {
 	kmemcheck_bitfield_end(flags);
 	int			sk_wmem_queued;
 	gfp_t			sk_allocation;
+	u32			sk_pacing_rate; /* bytes per second */
 	netdev_features_t	sk_route_caps;
 	netdev_features_t	sk_route_nocaps;
 	int			sk_gso_type;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 09cb5c1..73fcd7c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -281,6 +281,7 @@ extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
 extern unsigned int sysctl_tcp_notsent_lowat;
+extern int sysctl_tcp_min_tso_segs;
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 8ed7c32..540279f 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -29,6 +29,7 @@
 static int zero;
 static int one = 1;
 static int four = 4;
+static int gso_max_segs = GSO_MAX_SEGS;
 static int tcp_retr1_max = 255;
 static int ip_local_port_range_min[] = { 1, 1 };
 static int ip_local_port_range_max[] = { 65535, 65535 };
@@ -761,6 +762,15 @@ static struct ctl_table ipv4_table[] = {
 		.extra2		= &four,
 	},
 	{
+		.procname	= "tcp_min_tso_segs",
+		.data		= &sysctl_tcp_min_tso_segs,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &gso_max_segs,
+	},
+	{
 		.procname	= "udp_mem",
 		.data		= &sysctl_udp_mem,
 		.maxlen		= sizeof(sysctl_udp_mem),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 4e42c03..fdf7409 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -283,6 +283,8 @@
 
 int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
 
+int sysctl_tcp_min_tso_segs __read_mostly = 2;
+
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
@@ -785,12 +787,28 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
 	xmit_size_goal = mss_now;
 
 	if (large_allowed && sk_can_gso(sk)) {
-		xmit_size_goal = ((sk->sk_gso_max_size - 1) -
-				  inet_csk(sk)->icsk_af_ops->net_header_len -
-				  inet_csk(sk)->icsk_ext_hdr_len -
-				  tp->tcp_header_len);
+		u32 gso_size, hlen;
+
+		/* Maybe we should/could use sk->sk_prot->max_header here ? */
+		hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
+		       inet_csk(sk)->icsk_ext_hdr_len +
+		       tp->tcp_header_len;
+
+		/* Goal is to send at least one packet per ms,
+		 * not one big TSO packet every 100 ms.
+		 * This preserves ACK clocking and is consistent
+		 * with tcp_tso_should_defer() heuristic.
+		 */
+		gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC);
+		gso_size = max_t(u32, gso_size,
+				 sysctl_tcp_min_tso_segs * mss_now);
+
+		xmit_size_goal = min_t(u32, gso_size,
+				       sk->sk_gso_max_size - 1 - hlen);
 
-		/* TSQ : try to have two TSO segments in flight */
+		/* TSQ : try to have at least two segments in flight
+		 * (one in NIC TX ring, another in Qdisc)
+		 */
 		xmit_size_goal = min_t(u32, xmit_size_goal,
 				       sysctl_tcp_limit_output_bytes >> 1);
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ec492ea..436c7e8 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -688,6 +688,34 @@ static void tcp_rtt_estimator(struct sock *sk, const __u32 mrtt)
 	}
 }
 
+/* Set the sk_pacing_rate to allow proper sizing of TSO packets.
+ * Note: TCP stack does not yet implement pacing.
+ * FQ packet scheduler can be used to implement cheap but effective
+ * TCP pacing, to smooth the burst on large writes when packets
+ * in flight is significantly lower than cwnd (or rwin)
+ */
+static void tcp_update_pacing_rate(struct sock *sk)
+{
+	const struct tcp_sock *tp = tcp_sk(sk);
+	u64 rate;
+
+	/* set sk_pacing_rate to 200 % of current rate (mss * cwnd / srtt) */
+	rate = (u64)tp->mss_cache * 2 * (HZ << 3);
+
+	rate *= max(tp->snd_cwnd, tp->packets_out);
+
+	/* Correction for small srtt : minimum srtt being 8 (1 jiffy << 3),
+	 * be conservative and assume srtt = 1 (125 us instead of 1.25 ms)
+	 * We probably need usec resolution in the future.
+	 * Note: This also takes care of possible srtt=0 case,
+	 * when tcp_rtt_estimator() was not yet called.
+	 */
+	if (tp->srtt > 8 + 2)
+		do_div(rate, tp->srtt);
+
+	sk->sk_pacing_rate = min_t(u64, rate, ~0U);
+}
+
 /* Calculate rto without backoff.  This is the second half of Van Jacobson's
  * routine referred to above.
  */
@@ -3278,7 +3306,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	u32 ack_seq = TCP_SKB_CB(skb)->seq;
 	u32 ack = TCP_SKB_CB(skb)->ack_seq;
 	bool is_dupack = false;
-	u32 prior_in_flight;
+	u32 prior_in_flight, prior_cwnd = tp->snd_cwnd, prior_rtt = tp->srtt;
 	u32 prior_fackets;
 	int prior_packets = tp->packets_out;
 	const int prior_unsacked = tp->packets_out - tp->sacked_out;
@@ -3383,6 +3411,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 
 	if (icsk->icsk_pending == ICSK_TIME_RETRANS)
 		tcp_schedule_loss_probe(sk);
+	if (tp->srtt != prior_rtt || tp->snd_cwnd != prior_cwnd)
+		tcp_update_pacing_rate(sk);
 	return 1;
 
 no_queue:
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 884efff..e63ae4c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1631,7 +1631,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
 
 	/* If a full-sized TSO skb can be sent, do it. */
 	if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
-			   sk->sk_gso_max_segs * tp->mss_cache))
+			   tp->xmit_size_goal_segs * tp->mss_cache))
 		goto send_now;
 
 	/* Middle in queue won't get any more data, full sendable already? */

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net-next] tcp: TSO packets automatic sizing
  2013-08-27 12:46   ` [PATCH v3 " Eric Dumazet
@ 2013-08-28  0:17     ` Yuchung Cheng
  2013-08-28  0:21     ` Neal Cardwell
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: Yuchung Cheng @ 2013-08-28  0:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Neal Cardwell, Van Jacobson, Tom Herbert

On Tue, Aug 27, 2013 at 5:46 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> After hearing many people over past years complaining against TSO being
> bursty or even buggy, we are proud to present automatic sizing of TSO
> packets.
>
> One part of the problem is that tcp_tso_should_defer() uses an heuristic
> relying on upcoming ACKS instead of a timer, but more generally, having
> big TSO packets makes little sense for low rates, as it tends to create
> micro bursts on the network, and general consensus is to reduce the
> buffering amount.
>
> This patch introduces a per socket sk_pacing_rate, that approximates
> the current sending rate, and allows us to size the TSO packets so
> that we try to send one packet every ms.
>
> This field could be set by other transports.
>
> Patch has no impact for high speed flows, where having large TSO packets
> makes sense to reach line rate.
>
> For other flows, this helps better packet scheduling and ACK clocking.
>
> This patch increases performance of TCP flows in lossy environments.
>
> A new sysctl (tcp_min_tso_segs) is added, to specify the
> minimal size of a TSO packet (default being 2).
>
> A follow-up patch will provide a new packet scheduler (FQ), using
> sk_pacing_rate as an input to perform optional per flow pacing.
>
> This explains why we chose to set sk_pacing_rate to twice the current
> rate, allowing 'slow start' ramp up.
>
> sk_pacing_rate = 2 * cwnd * mss / srtt
>
> v2: Neal Cardwell reported a suspect deferring of last two segments on
> initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
> into account tp->xmit_size_goal_segs
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Van Jacobson <vanj@google.com>
> Cc: Tom Herbert <therbert@google.com>
> ---
> v3: The change Yuchung suggested added a possibility of a divide by 0:
>     On some (retransmits) case, srtt can be 0 because
>     tcp_rtt_estimator() has not yet been called.
>     Change the computation to remove this, and do not yet use usec
>     as the units, but HZ. [ Its interesting to see jiffies_to_usecs()
>     being an out of line function :( ]
>
> This version passed all our tests.
>
>  Documentation/networking/ip-sysctl.txt |    9 ++++++
>  include/net/sock.h                     |    2 +
>  include/net/tcp.h                      |    1
>  net/ipv4/sysctl_net_ipv4.c             |   10 +++++++
>  net/ipv4/tcp.c                         |   28 ++++++++++++++++----
>  net/ipv4/tcp_input.c                   |   32 ++++++++++++++++++++++-
>  net/ipv4/tcp_output.c                  |    2 -
>  7 files changed, 77 insertions(+), 7 deletions(-)
Acked-by: Yuchung Cheng <ycheng@google.com>

>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index debfe85..ce5bb43 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -482,6 +482,15 @@ tcp_syn_retries - INTEGER
>  tcp_timestamps - BOOLEAN
>         Enable timestamps as defined in RFC1323.
>
> +tcp_min_tso_segs - INTEGER
> +       Minimal number of segments per TSO frame.
> +       Since linux-3.12, TCP does an automatic sizing of TSO frames,
> +       depending on flow rate, instead of filling 64Kbytes packets.
> +       For specific usages, it's possible to force TCP to build big
> +       TSO frames. Note that TCP stack might split too big TSO packets
> +       if available window is too small.
> +       Default: 2
> +
>  tcp_tso_win_divisor - INTEGER
>         This allows control over what percentage of the congestion window
>         can be consumed by a single TSO frame.
> diff --git a/include/net/sock.h b/include/net/sock.h
> index e4bbcbf..6ba2e7b 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -232,6 +232,7 @@ struct cg_proto;
>    *    @sk_napi_id: id of the last napi context to receive data for sk
>    *    @sk_ll_usec: usecs to busypoll when there is no data
>    *    @sk_allocation: allocation mode
> +  *    @sk_pacing_rate: Pacing rate (if supported by transport/packet scheduler)
>    *    @sk_sndbuf: size of send buffer in bytes
>    *    @sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
>    *               %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
> @@ -361,6 +362,7 @@ struct sock {
>         kmemcheck_bitfield_end(flags);
>         int                     sk_wmem_queued;
>         gfp_t                   sk_allocation;
> +       u32                     sk_pacing_rate; /* bytes per second */
>         netdev_features_t       sk_route_caps;
>         netdev_features_t       sk_route_nocaps;
>         int                     sk_gso_type;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 09cb5c1..73fcd7c 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -281,6 +281,7 @@ extern int sysctl_tcp_early_retrans;
>  extern int sysctl_tcp_limit_output_bytes;
>  extern int sysctl_tcp_challenge_ack_limit;
>  extern unsigned int sysctl_tcp_notsent_lowat;
> +extern int sysctl_tcp_min_tso_segs;
>
>  extern atomic_long_t tcp_memory_allocated;
>  extern struct percpu_counter tcp_sockets_allocated;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 8ed7c32..540279f 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -29,6 +29,7 @@
>  static int zero;
>  static int one = 1;
>  static int four = 4;
> +static int gso_max_segs = GSO_MAX_SEGS;
>  static int tcp_retr1_max = 255;
>  static int ip_local_port_range_min[] = { 1, 1 };
>  static int ip_local_port_range_max[] = { 65535, 65535 };
> @@ -761,6 +762,15 @@ static struct ctl_table ipv4_table[] = {
>                 .extra2         = &four,
>         },
>         {
> +               .procname       = "tcp_min_tso_segs",
> +               .data           = &sysctl_tcp_min_tso_segs,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax,
> +               .extra1         = &zero,
> +               .extra2         = &gso_max_segs,
> +       },
> +       {
>                 .procname       = "udp_mem",
>                 .data           = &sysctl_udp_mem,
>                 .maxlen         = sizeof(sysctl_udp_mem),
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 4e42c03..fdf7409 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -283,6 +283,8 @@
>
>  int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
>
> +int sysctl_tcp_min_tso_segs __read_mostly = 2;
> +
>  struct percpu_counter tcp_orphan_count;
>  EXPORT_SYMBOL_GPL(tcp_orphan_count);
>
> @@ -785,12 +787,28 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
>         xmit_size_goal = mss_now;
>
>         if (large_allowed && sk_can_gso(sk)) {
> -               xmit_size_goal = ((sk->sk_gso_max_size - 1) -
> -                                 inet_csk(sk)->icsk_af_ops->net_header_len -
> -                                 inet_csk(sk)->icsk_ext_hdr_len -
> -                                 tp->tcp_header_len);
> +               u32 gso_size, hlen;
> +
> +               /* Maybe we should/could use sk->sk_prot->max_header here ? */
> +               hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
> +                      inet_csk(sk)->icsk_ext_hdr_len +
> +                      tp->tcp_header_len;
> +
> +               /* Goal is to send at least one packet per ms,
> +                * not one big TSO packet every 100 ms.
> +                * This preserves ACK clocking and is consistent
> +                * with tcp_tso_should_defer() heuristic.
> +                */
> +               gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC);
> +               gso_size = max_t(u32, gso_size,
> +                                sysctl_tcp_min_tso_segs * mss_now);
> +
> +               xmit_size_goal = min_t(u32, gso_size,
> +                                      sk->sk_gso_max_size - 1 - hlen);
>
> -               /* TSQ : try to have two TSO segments in flight */
> +               /* TSQ : try to have at least two segments in flight
> +                * (one in NIC TX ring, another in Qdisc)
> +                */
>                 xmit_size_goal = min_t(u32, xmit_size_goal,
>                                        sysctl_tcp_limit_output_bytes >> 1);
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index ec492ea..436c7e8 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -688,6 +688,34 @@ static void tcp_rtt_estimator(struct sock *sk, const __u32 mrtt)
>         }
>  }
>
> +/* Set the sk_pacing_rate to allow proper sizing of TSO packets.
> + * Note: TCP stack does not yet implement pacing.
> + * FQ packet scheduler can be used to implement cheap but effective
> + * TCP pacing, to smooth the burst on large writes when packets
> + * in flight is significantly lower than cwnd (or rwin)
> + */
> +static void tcp_update_pacing_rate(struct sock *sk)
> +{
> +       const struct tcp_sock *tp = tcp_sk(sk);
> +       u64 rate;
> +
> +       /* set sk_pacing_rate to 200 % of current rate (mss * cwnd / srtt) */
> +       rate = (u64)tp->mss_cache * 2 * (HZ << 3);
> +
> +       rate *= max(tp->snd_cwnd, tp->packets_out);
> +
> +       /* Correction for small srtt : minimum srtt being 8 (1 jiffy << 3),
> +        * be conservative and assume srtt = 1 (125 us instead of 1.25 ms)
> +        * We probably need usec resolution in the future.
> +        * Note: This also takes care of possible srtt=0 case,
> +        * when tcp_rtt_estimator() was not yet called.
> +        */
> +       if (tp->srtt > 8 + 2)
> +               do_div(rate, tp->srtt);
> +
> +       sk->sk_pacing_rate = min_t(u64, rate, ~0U);
> +}
> +
>  /* Calculate rto without backoff.  This is the second half of Van Jacobson's
>   * routine referred to above.
>   */
> @@ -3278,7 +3306,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>         u32 ack_seq = TCP_SKB_CB(skb)->seq;
>         u32 ack = TCP_SKB_CB(skb)->ack_seq;
>         bool is_dupack = false;
> -       u32 prior_in_flight;
> +       u32 prior_in_flight, prior_cwnd = tp->snd_cwnd, prior_rtt = tp->srtt;
>         u32 prior_fackets;
>         int prior_packets = tp->packets_out;
>         const int prior_unsacked = tp->packets_out - tp->sacked_out;
> @@ -3383,6 +3411,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>
>         if (icsk->icsk_pending == ICSK_TIME_RETRANS)
>                 tcp_schedule_loss_probe(sk);
> +       if (tp->srtt != prior_rtt || tp->snd_cwnd != prior_cwnd)
> +               tcp_update_pacing_rate(sk);
>         return 1;
>
>  no_queue:
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 884efff..e63ae4c 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1631,7 +1631,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
>
>         /* If a full-sized TSO skb can be sent, do it. */
>         if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
> -                          sk->sk_gso_max_segs * tp->mss_cache))
> +                          tp->xmit_size_goal_segs * tp->mss_cache))
>                 goto send_now;
>
>         /* Middle in queue won't get any more data, full sendable already? */
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net-next] tcp: TSO packets automatic sizing
  2013-08-27 12:46   ` [PATCH v3 " Eric Dumazet
  2013-08-28  0:17     ` Yuchung Cheng
@ 2013-08-28  0:21     ` Neal Cardwell
  2013-08-28  7:37     ` Jason Wang
  2013-08-29 19:51     ` David Miller
  3 siblings, 0 replies; 25+ messages in thread
From: Neal Cardwell @ 2013-08-28  0:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Yuchung Cheng, Van Jacobson, Tom Herbert

On Tue, Aug 27, 2013 at 8:46 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> After hearing many people over past years complaining against TSO being
> bursty or even buggy, we are proud to present automatic sizing of TSO
> packets.
>
> One part of the problem is that tcp_tso_should_defer() uses an heuristic
> relying on upcoming ACKS instead of a timer, but more generally, having
> big TSO packets makes little sense for low rates, as it tends to create
> micro bursts on the network, and general consensus is to reduce the
> buffering amount.
>
> This patch introduces a per socket sk_pacing_rate, that approximates
> the current sending rate, and allows us to size the TSO packets so
> that we try to send one packet every ms.
>
> This field could be set by other transports.
>
> Patch has no impact for high speed flows, where having large TSO packets
> makes sense to reach line rate.
>
> For other flows, this helps better packet scheduling and ACK clocking.
>
> This patch increases performance of TCP flows in lossy environments.
>
> A new sysctl (tcp_min_tso_segs) is added, to specify the
> minimal size of a TSO packet (default being 2).
>
> A follow-up patch will provide a new packet scheduler (FQ), using
> sk_pacing_rate as an input to perform optional per flow pacing.
>
> This explains why we chose to set sk_pacing_rate to twice the current
> rate, allowing 'slow start' ramp up.
>
> sk_pacing_rate = 2 * cwnd * mss / srtt
>
> v2: Neal Cardwell reported a suspect deferring of last two segments on
> initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
> into account tp->xmit_size_goal_segs
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Van Jacobson <vanj@google.com>
> Cc: Tom Herbert <therbert@google.com>
> ---
> v3: The change Yuchung suggested added a possibility of a divide by 0:
>     On some (retransmits) case, srtt can be 0 because
>     tcp_rtt_estimator() has not yet been called.
>     Change the computation to remove this, and do not yet use usec
>     as the units, but HZ. [ Its interesting to see jiffies_to_usecs()
>     being an out of line function :( ]
>
> This version passed all our tests.
>
>  Documentation/networking/ip-sysctl.txt |    9 ++++++
>  include/net/sock.h                     |    2 +
>  include/net/tcp.h                      |    1
>  net/ipv4/sysctl_net_ipv4.c             |   10 +++++++
>  net/ipv4/tcp.c                         |   28 ++++++++++++++++----
>  net/ipv4/tcp_input.c                   |   32 ++++++++++++++++++++++-
>  net/ipv4/tcp_output.c                  |    2 -
>  7 files changed, 77 insertions(+), 7 deletions(-)

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net-next] tcp: TSO packets automatic sizing
  2013-08-27 12:46   ` [PATCH v3 " Eric Dumazet
  2013-08-28  0:17     ` Yuchung Cheng
  2013-08-28  0:21     ` Neal Cardwell
@ 2013-08-28  7:37     ` Jason Wang
  2013-08-28 10:34       ` Eric Dumazet
  2013-08-29 19:51     ` David Miller
  3 siblings, 1 reply; 25+ messages in thread
From: Jason Wang @ 2013-08-28  7:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Neal Cardwell, Yuchung Cheng, Van Jacobson,
	Tom Herbert, Michael S. Tsirkin

On 08/27/2013 08:46 PM, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> After hearing many people over past years complaining against TSO being
> bursty or even buggy, we are proud to present automatic sizing of TSO
> packets.
>
> One part of the problem is that tcp_tso_should_defer() uses an heuristic
> relying on upcoming ACKS instead of a timer, but more generally, having
> big TSO packets makes little sense for low rates, as it tends to create
> micro bursts on the network, and general consensus is to reduce the
> buffering amount.
>
> This patch introduces a per socket sk_pacing_rate, that approximates
> the current sending rate, and allows us to size the TSO packets so
> that we try to send one packet every ms.
>
> This field could be set by other transports.
>
> Patch has no impact for high speed flows, where having large TSO packets
> makes sense to reach line rate.
>
> For other flows, this helps better packet scheduling and ACK clocking.
>
> This patch increases performance of TCP flows in lossy environments.
>
> A new sysctl (tcp_min_tso_segs) is added, to specify the
> minimal size of a TSO packet (default being 2).
>
> A follow-up patch will provide a new packet scheduler (FQ), using
> sk_pacing_rate as an input to perform optional per flow pacing.
>
> This explains why we chose to set sk_pacing_rate to twice the current
> rate, allowing 'slow start' ramp up.
>
> sk_pacing_rate = 2 * cwnd * mss / srtt
>  
> v2: Neal Cardwell reported a suspect deferring of last two segments on
> initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
> into account tp->xmit_size_goal_segs 
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Van Jacobson <vanj@google.com>
> Cc: Tom Herbert <therbert@google.com>
> ---
> v3: The change Yuchung suggested added a possibility of a divide by 0:
>     On some (retransmits) case, srtt can be 0 because
>     tcp_rtt_estimator() has not yet been called.
>     Change the computation to remove this, and do not yet use usec
>     as the units, but HZ. [ Its interesting to see jiffies_to_usecs()
>     being an out of line function :( ]
>
> This version passed all our tests.
>
>  Documentation/networking/ip-sysctl.txt |    9 ++++++
>  include/net/sock.h                     |    2 +
>  include/net/tcp.h                      |    1 
>  net/ipv4/sysctl_net_ipv4.c             |   10 +++++++
>  net/ipv4/tcp.c                         |   28 ++++++++++++++++----
>  net/ipv4/tcp_input.c                   |   32 ++++++++++++++++++++++-
>  net/ipv4/tcp_output.c                  |    2 -
>  7 files changed, 77 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index debfe85..ce5bb43 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -482,6 +482,15 @@ tcp_syn_retries - INTEGER
>  tcp_timestamps - BOOLEAN
>  	Enable timestamps as defined in RFC1323.
>  
> +tcp_min_tso_segs - INTEGER
> +	Minimal number of segments per TSO frame.
> +	Since linux-3.12, TCP does an automatic sizing of TSO frames,
> +	depending on flow rate, instead of filling 64Kbytes packets.
> +	For specific usages, it's possible to force TCP to build big
> +	TSO frames. Note that TCP stack might split too big TSO packets
> +	if available window is too small.
> +	Default: 2
> +
>  tcp_tso_win_divisor - INTEGER
>  	This allows control over what percentage of the congestion window
>  	can be consumed by a single TSO frame.
> diff --git a/include/net/sock.h b/include/net/sock.h
> index e4bbcbf..6ba2e7b 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -232,6 +232,7 @@ struct cg_proto;
>    *	@sk_napi_id: id of the last napi context to receive data for sk
>    *	@sk_ll_usec: usecs to busypoll when there is no data
>    *	@sk_allocation: allocation mode
> +  *	@sk_pacing_rate: Pacing rate (if supported by transport/packet scheduler)
>    *	@sk_sndbuf: size of send buffer in bytes
>    *	@sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
>    *		   %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
> @@ -361,6 +362,7 @@ struct sock {
>  	kmemcheck_bitfield_end(flags);
>  	int			sk_wmem_queued;
>  	gfp_t			sk_allocation;
> +	u32			sk_pacing_rate; /* bytes per second */
>  	netdev_features_t	sk_route_caps;
>  	netdev_features_t	sk_route_nocaps;
>  	int			sk_gso_type;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 09cb5c1..73fcd7c 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -281,6 +281,7 @@ extern int sysctl_tcp_early_retrans;
>  extern int sysctl_tcp_limit_output_bytes;
>  extern int sysctl_tcp_challenge_ack_limit;
>  extern unsigned int sysctl_tcp_notsent_lowat;
> +extern int sysctl_tcp_min_tso_segs;
>  
>  extern atomic_long_t tcp_memory_allocated;
>  extern struct percpu_counter tcp_sockets_allocated;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 8ed7c32..540279f 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -29,6 +29,7 @@
>  static int zero;
>  static int one = 1;
>  static int four = 4;
> +static int gso_max_segs = GSO_MAX_SEGS;
>  static int tcp_retr1_max = 255;
>  static int ip_local_port_range_min[] = { 1, 1 };
>  static int ip_local_port_range_max[] = { 65535, 65535 };
> @@ -761,6 +762,15 @@ static struct ctl_table ipv4_table[] = {
>  		.extra2		= &four,
>  	},
>  	{
> +		.procname	= "tcp_min_tso_segs",
> +		.data		= &sysctl_tcp_min_tso_segs,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= &zero,
> +		.extra2		= &gso_max_segs,
> +	},
> +	{
>  		.procname	= "udp_mem",
>  		.data		= &sysctl_udp_mem,
>  		.maxlen		= sizeof(sysctl_udp_mem),
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 4e42c03..fdf7409 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -283,6 +283,8 @@
>  
>  int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
>  
> +int sysctl_tcp_min_tso_segs __read_mostly = 2;
> +
>  struct percpu_counter tcp_orphan_count;
>  EXPORT_SYMBOL_GPL(tcp_orphan_count);
>  
> @@ -785,12 +787,28 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
>  	xmit_size_goal = mss_now;
>  
>  	if (large_allowed && sk_can_gso(sk)) {
> -		xmit_size_goal = ((sk->sk_gso_max_size - 1) -
> -				  inet_csk(sk)->icsk_af_ops->net_header_len -
> -				  inet_csk(sk)->icsk_ext_hdr_len -
> -				  tp->tcp_header_len);
> +		u32 gso_size, hlen;
> +
> +		/* Maybe we should/could use sk->sk_prot->max_header here ? */
> +		hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
> +		       inet_csk(sk)->icsk_ext_hdr_len +
> +		       tp->tcp_header_len;
> +
> +		/* Goal is to send at least one packet per ms,
> +		 * not one big TSO packet every 100 ms.
> +		 * This preserves ACK clocking and is consistent
> +		 * with tcp_tso_should_defer() heuristic.
> +		 */
> +		gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC);
> +		gso_size = max_t(u32, gso_size,
> +				 sysctl_tcp_min_tso_segs * mss_now);
> +
> +		xmit_size_goal = min_t(u32, gso_size,
> +				       sk->sk_gso_max_size - 1 - hlen);
>  
> -		/* TSQ : try to have two TSO segments in flight */
> +		/* TSQ : try to have at least two segments in flight
> +		 * (one in NIC TX ring, another in Qdisc)
> +		 */
>  		xmit_size_goal = min_t(u32, xmit_size_goal,
>  				       sysctl_tcp_limit_output_bytes >> 1);
>  
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index ec492ea..436c7e8 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -688,6 +688,34 @@ static void tcp_rtt_estimator(struct sock *sk, const __u32 mrtt)
>  	}
>  }
>  
> +/* Set the sk_pacing_rate to allow proper sizing of TSO packets.
> + * Note: TCP stack does not yet implement pacing.
> + * FQ packet scheduler can be used to implement cheap but effective
> + * TCP pacing, to smooth the burst on large writes when packets
> + * in flight is significantly lower than cwnd (or rwin)
> + */
> +static void tcp_update_pacing_rate(struct sock *sk)
> +{
> +	const struct tcp_sock *tp = tcp_sk(sk);
> +	u64 rate;
> +
> +	/* set sk_pacing_rate to 200 % of current rate (mss * cwnd / srtt) */
> +	rate = (u64)tp->mss_cache * 2 * (HZ << 3);
> +
> +	rate *= max(tp->snd_cwnd, tp->packets_out);
> +
> +	/* Correction for small srtt : minimum srtt being 8 (1 jiffy << 3),
> +	 * be conservative and assume srtt = 1 (125 us instead of 1.25 ms)
> +	 * We probably need usec resolution in the future.
> +	 * Note: This also takes care of possible srtt=0 case,
> +	 * when tcp_rtt_estimator() was not yet called.
> +	 */
> +	if (tp->srtt > 8 + 2)
> +		do_div(rate, tp->srtt);
> +
> +	sk->sk_pacing_rate = min_t(u64, rate, ~0U);
> +}
> +
>  /* Calculate rto without backoff.  This is the second half of Van Jacobson's
>   * routine referred to above.
>   */
> @@ -3278,7 +3306,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>  	u32 ack_seq = TCP_SKB_CB(skb)->seq;
>  	u32 ack = TCP_SKB_CB(skb)->ack_seq;
>  	bool is_dupack = false;
> -	u32 prior_in_flight;
> +	u32 prior_in_flight, prior_cwnd = tp->snd_cwnd, prior_rtt = tp->srtt;
>  	u32 prior_fackets;
>  	int prior_packets = tp->packets_out;
>  	const int prior_unsacked = tp->packets_out - tp->sacked_out;
> @@ -3383,6 +3411,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>  
>  	if (icsk->icsk_pending == ICSK_TIME_RETRANS)
>  		tcp_schedule_loss_probe(sk);
> +	if (tp->srtt != prior_rtt || tp->snd_cwnd != prior_cwnd)
> +		tcp_update_pacing_rate(sk);
>  	return 1;
>  
>  no_queue:
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 884efff..e63ae4c 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1631,7 +1631,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
>  
>  	/* If a full-sized TSO skb can be sent, do it. */
>  	if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
> -			   sk->sk_gso_max_segs * tp->mss_cache))
> +			   tp->xmit_size_goal_segs * tp->mss_cache))
>  		goto send_now;
A question is: Does this really guarantee the minimal TSO segments
excluding the case of small available window? The skb->len may be much
smaller and can still be sent here. Maybe we should check skb->len also?

>  
>  	/* Middle in queue won't get any more data, full sendable already? */
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net-next] tcp: TSO packets automatic sizing
  2013-08-28  7:37     ` Jason Wang
@ 2013-08-28 10:34       ` Eric Dumazet
  2013-08-30  3:02         ` Jason Wang
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2013-08-28 10:34 UTC (permalink / raw)
  To: Jason Wang
  Cc: David Miller, netdev, Neal Cardwell, Yuchung Cheng, Van Jacobson,
	Tom Herbert, Michael S. Tsirkin

On Wed, 2013-08-28 at 15:37 +0800, Jason Wang wrote:

> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 884efff..e63ae4c 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -1631,7 +1631,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
> >  
> >  	/* If a full-sized TSO skb can be sent, do it. */
> >  	if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
> > -			   sk->sk_gso_max_segs * tp->mss_cache))
> > +			   tp->xmit_size_goal_segs * tp->mss_cache))
> >  		goto send_now;
> A question is: Does this really guarantee the minimal TSO segments
> excluding the case of small available window? The skb->len may be much
> smaller and can still be sent here. Maybe we should check skb->len also?

tcp_tso_should_defer() is all about hoping the application will
'complete' the last skb in write queue with more payload in the near
future.

skb->len might therefore change because sendmsg()/sendpage() will add
new stuff in the skb.

We try hard to not remove tcp_tso_should_defer() and take the best of
it. We have not yet decided to add a real timer instead of relying on
upcoming ACKS.

Neal has an idea/patch to avoid a defer depending on
the expected time of following ACKS.

By making the TSO sizes smaller for low rates, we avoid these stalls
from tcp_tso_should_defer(), because an incoming ACK has normally freed
enough window to send the next packet in write queue without the need to
split it into two parts.

These changes are fundamental to use delay based congestion modules like
Vegas/Westwood and experimental new ones, without having to disable TSO.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net-next] tcp: TSO packets automatic sizing
  2013-08-28 10:34       ` Eric Dumazet
@ 2013-08-30  3:02         ` Jason Wang
  0 siblings, 0 replies; 25+ messages in thread
From: Jason Wang @ 2013-08-30  3:02 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Neal Cardwell, Yuchung Cheng, Van Jacobson,
	Tom Herbert, Michael S. Tsirkin

On 08/28/2013 06:34 PM, Eric Dumazet wrote:
> On Wed, 2013-08-28 at 15:37 +0800, Jason Wang wrote:
>
>>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>>> index 884efff..e63ae4c 100644
>>> --- a/net/ipv4/tcp_output.c
>>> +++ b/net/ipv4/tcp_output.c
>>> @@ -1631,7 +1631,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
>>>  
>>>  	/* If a full-sized TSO skb can be sent, do it. */
>>>  	if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
>>> -			   sk->sk_gso_max_segs * tp->mss_cache))
>>> +			   tp->xmit_size_goal_segs * tp->mss_cache))
>>>  		goto send_now;
>> A question is: Does this really guarantee the minimal TSO segments
>> excluding the case of small available window? The skb->len may be much
>> smaller and can still be sent here. Maybe we should check skb->len also?
> tcp_tso_should_defer() is all about hoping the application will
> 'complete' the last skb in write queue with more payload in the near
> future.
>
> skb->len might therefore change because sendmsg()/sendpage() will add
> new stuff in the skb.

Ture, but sometimes the application may be slow to fill the bytes into
skb. Especially the application run in virt guest with multiqueue. In
the case, the application in guest tends to be slower than the
nic(virtio-net) which does the transmission through a host thread
(vhost). Looks like current defer algorithm could not do this very well
and if we want to force the batching of 64K packet, tcp_min_tso_segs
could not works well also.
> We try hard to not remove tcp_tso_should_defer() and take the best of
> it. We have not yet decided to add a real timer instead of relying on
> upcoming ACKS.
>
> Neal has an idea/patch to avoid a defer depending on
> the expected time of following ACKS.
>
> By making the TSO sizes smaller for low rates, we avoid these stalls
> from tcp_tso_should_defer(), because an incoming ACK has normally freed
> enough window to send the next packet in write queue without the need to
> split it into two parts.
>
> These changes are fundamental to use delay based congestion modules like
> Vegas/Westwood and experimental new ones, without having to disable TSO.
>
>
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net-next] tcp: TSO packets automatic sizing
  2013-08-27 12:46   ` [PATCH v3 " Eric Dumazet
                       ` (2 preceding siblings ...)
  2013-08-28  7:37     ` Jason Wang
@ 2013-08-29 19:51     ` David Miller
  2013-08-29 20:26       ` Eric Dumazet
  3 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2013-08-29 19:51 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, ncardwell, ycheng, vanj, therbert

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 27 Aug 2013 05:46:32 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> After hearing many people over past years complaining against TSO being
> bursty or even buggy, we are proud to present automatic sizing of TSO
> packets.
> 
> One part of the problem is that tcp_tso_should_defer() uses an heuristic
> relying on upcoming ACKS instead of a timer, but more generally, having
> big TSO packets makes little sense for low rates, as it tends to create
> micro bursts on the network, and general consensus is to reduce the
> buffering amount.
> 
> This patch introduces a per socket sk_pacing_rate, that approximates
> the current sending rate, and allows us to size the TSO packets so
> that we try to send one packet every ms.
> 
> This field could be set by other transports.
> 
> Patch has no impact for high speed flows, where having large TSO packets
> makes sense to reach line rate.
> 
> For other flows, this helps better packet scheduling and ACK clocking.
> 
> This patch increases performance of TCP flows in lossy environments.
> 
> A new sysctl (tcp_min_tso_segs) is added, to specify the
> minimal size of a TSO packet (default being 2).
> 
> A follow-up patch will provide a new packet scheduler (FQ), using
> sk_pacing_rate as an input to perform optional per flow pacing.
> 
> This explains why we chose to set sk_pacing_rate to twice the current
> rate, allowing 'slow start' ramp up.
> 
> sk_pacing_rate = 2 * cwnd * mss / srtt
>  
> v2: Neal Cardwell reported a suspect deferring of last two segments on
> initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
> into account tp->xmit_size_goal_segs 
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, please post a new copy of your accompanying packet scheduler.

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net-next] tcp: TSO packets automatic sizing
  2013-08-29 19:51     ` David Miller
@ 2013-08-29 20:26       ` Eric Dumazet
  2013-08-29 20:35         ` David Miller
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2013-08-29 20:26 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, ncardwell, ycheng, vanj, therbert

On Thu, 2013-08-29 at 15:51 -0400, David Miller wrote:

> Applied, please post a new copy of your accompanying packet scheduler.
> 
> Thanks.

Thanks David.

I am a bit puzzled by the caching of srtt in tcp metrics. We ten to
cache bufferbloated values that are almost useless.

On this 50ms RTT link, the syn/synack rtt was correctly sampled at 51
jiffies, but tcp_init_metrics() finds a very high srtt cached from
previous tcp flow, which ended its life with a huge cwin=327/srtt=1468
because of bufferbloat.

Since the new connexion starts with IW10, the estimated rate is slightly
wrong for the first ~10 incoming acks, before ewma converges to the
right value...

[ 4544.656476] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 51 sack_rtt 4294967295
[ 4544.656482] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 51 srtt 0
[ 4544.656496] TCP: sk ffff88085825d180 cwnd 10 packets 0 rate 231680000/srtt 408
[ 4544.707045] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 50 sack_rtt 4294967295
[ 4544.707051] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 50 srtt 1468
[ 4544.707055] TCP: sk ffff88085825d180 cwnd 11 packets 9 rate 254848000/srtt 1335
[ 4544.707067] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 50 sack_rtt 4294967295
[ 4544.707069] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 50 srtt 1335
[ 4544.707071] TCP: sk ffff88085825d180 cwnd 12 packets 10 rate 278016000/srtt 1219
[ 4544.707694] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 51 sack_rtt 4294967295
[ 4544.707699] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 51 srtt 1219
[ 4544.707703] TCP: sk ffff88085825d180 cwnd 13 packets 11 rate 301184000/srtt 1118
[ 4544.708324] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 52 sack_rtt 4294967295
[ 4544.708330] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 52 srtt 1118
[ 4544.708333] TCP: sk ffff88085825d180 cwnd 14 packets 12 rate 324352000/srtt 1031
[ 4544.708846] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 52 sack_rtt 4294967295
[ 4544.708851] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 52 srtt 1031
[ 4544.708855] TCP: sk ffff88085825d180 cwnd 15 packets 13 rate 347520000/srtt 955
[ 4544.709521] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 53 sack_rtt 4294967295
[ 4544.709526] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 53 srtt 955
[ 4544.709530] TCP: sk ffff88085825d180 cwnd 16 packets 14 rate 370688000/srtt 889
[ 4544.710103] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 53 sack_rtt 4294967295
[ 4544.710108] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 53 srtt 889
[ 4544.710111] TCP: sk ffff88085825d180 cwnd 17 packets 15 rate 393856000/srtt 831
[ 4544.710683] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 54 sack_rtt 4294967295
[ 4544.710688] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 54 srtt 831
[ 4544.710691] TCP: sk ffff88085825d180 cwnd 18 packets 16 rate 417024000/srtt 782
[ 4544.711210] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 55 sack_rtt 4294967295
[ 4544.711215] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 55 srtt 782
[ 4544.711219] TCP: sk ffff88085825d180 cwnd 19 packets 17 rate 440192000/srtt 740
[ 4544.711868] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 55 sack_rtt 4294967295
[ 4544.711873] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 55 srtt 740
[ 4544.711876] TCP: sk ffff88085825d180 cwnd 20 packets 18 rate 463360000/srtt 703
[ 4544.757576] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 51 sack_rtt 4294967295
[ 4544.757581] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 51 srtt 703
[ 4544.757585] TCP: sk ffff88085825d180 cwnd 21 packets 19 rate 486528000/srtt 667
[ 4544.757595] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 51 sack_rtt 4294967295
[ 4544.757597] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 51 srtt 667
[ 4544.757610] TCP: sk ffff88085825d180 cwnd 22 packets 20 rate 509696000/srtt 635
[ 4544.773527] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 67 sack_rtt 4294967295
[ 4544.773533] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 67 srtt 635
[ 4544.773536] TCP: sk ffff88085825d180 cwnd 23 packets 21 rate 532864000/srtt 623
[ 4544.773548] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 67 sack_rtt 4294967295
[ 4544.773560] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 67 srtt 623
[ 4544.773562] TCP: sk ffff88085825d180 cwnd 24 packets 22 rate 556032000/srtt 613
[ 4544.778208] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 71 sack_rtt 4294967295
[ 4544.778213] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 71 srtt 613
[ 4544.778216] TCP: sk ffff88085825d180 cwnd 25 packets 23 rate 579200000/srtt 608
[ 4544.778237] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 71 sack_rtt 4294967295
[ 4544.778238] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 71 srtt 608
[ 4544.778240] TCP: sk ffff88085825d180 cwnd 26 packets 24 rate 602368000/srtt 603
[ 4544.782776] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 74 sack_rtt 4294967295
[ 4544.782781] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 74 srtt 603
[ 4544.782785] TCP: sk ffff88085825d180 cwnd 27 packets 25 rate 625536000/srtt 602
[ 4544.782795] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 74 sack_rtt 4294967295
[ 4544.782808] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 74 srtt 602
...
Typical bufferbloat at the end of transfert :

[ 4547.051521] TCP: sk ffff88085825d180 cwnd 327 packets 3 rate 7575936000/srtt 1581
[ 4547.052722] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 198 sack_rtt 4294967295
[ 4547.052726] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 198 srtt 1581
[ 4547.052729] TCP: sk ffff88085825d180 cwnd 327 packets 1 rate 7575936000/srtt 1582
[ 4547.053315] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 198 sack_rtt 4294967295
[ 4547.053318] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 198 srtt 1582
[ 4547.053321] TCP: sk ffff88085825d180 cwnd 327 packets 0 rate 7575936000/srtt 1583

Maybe we could instead store a value corrected by the sk_pacing_rate

rate = (big_cwin * mss) / big_srtt

stored_rtt = rate / (big_cwin * mss)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net-next] tcp: TSO packets automatic sizing
  2013-08-29 20:26       ` Eric Dumazet
@ 2013-08-29 20:35         ` David Miller
  2013-08-29 21:26           ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2013-08-29 20:35 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, ncardwell, ycheng, vanj, therbert

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 29 Aug 2013 13:26:17 -0700

> Typical bufferbloat at the end of transfert :
> 
> [ 4547.051521] TCP: sk ffff88085825d180 cwnd 327 packets 3 rate 7575936000/srtt 1581
> [ 4547.052722] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 198 sack_rtt 4294967295
> [ 4547.052726] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 198 srtt 1581
> [ 4547.052729] TCP: sk ffff88085825d180 cwnd 327 packets 1 rate 7575936000/srtt 1582
> [ 4547.053315] TCP: tcp_ack_update_rtt sk ffff88085825d180 seq_rtt 198 sack_rtt 4294967295
> [ 4547.053318] TCP: tcp_rtt_estimator sk ffff88085825d180 mrtt 198 srtt 1582
> [ 4547.053321] TCP: sk ffff88085825d180 cwnd 327 packets 0 rate 7575936000/srtt 1583
> 
> Maybe we could instead store a value corrected by the sk_pacing_rate
> 
> rate = (big_cwin * mss) / big_srtt
> 
> stored_rtt = rate / (big_cwin * mss)

No objections from me.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 net-next] tcp: TSO packets automatic sizing
  2013-08-29 20:35         ` David Miller
@ 2013-08-29 21:26           ` Eric Dumazet
  0 siblings, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2013-08-29 21:26 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, ncardwell, ycheng, vanj, therbert

On Thu, 2013-08-29 at 16:35 -0400, David Miller wrote:

> 
> No objections from me.

We'll cook a different patch.

Idea is to feed tcp_set_rto() with the srtt found in the tcp metric
cache, and let tp->srtt value found in SYN/SYNACK (if available) as is.

(Be conservative for initial rto value, yet allow tp->srtt be the
current rtt on the network)

Thanks

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2013-08-30  3:02 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-24  0:29 [PATCH net-next] tcp: TSO packets automatic sizing Eric Dumazet
2013-08-24  3:17 ` Neal Cardwell
2013-08-24 18:56   ` Eric Dumazet
2013-08-24 20:28     ` Eric Dumazet
2013-08-25 22:01     ` Yuchung Cheng
2013-08-26  0:37       ` Eric Dumazet
2013-08-26  2:22         ` Eric Dumazet
2013-08-26  3:58           ` Eric Dumazet
2013-08-25  2:46 ` David Miller
2013-08-25  2:52   ` Eric Dumazet
2013-08-26  4:26 ` [PATCH v2 " Eric Dumazet
2013-08-26 19:09   ` Yuchung Cheng
2013-08-26 20:28     ` Eric Dumazet
2013-08-26 22:31       ` Yuchung Cheng
2013-08-27  0:47   ` Eric Dumazet
2013-08-27 12:46   ` [PATCH v3 " Eric Dumazet
2013-08-28  0:17     ` Yuchung Cheng
2013-08-28  0:21     ` Neal Cardwell
2013-08-28  7:37     ` Jason Wang
2013-08-28 10:34       ` Eric Dumazet
2013-08-30  3:02         ` Jason Wang
2013-08-29 19:51     ` David Miller
2013-08-29 20:26       ` Eric Dumazet
2013-08-29 20:35         ` David Miller
2013-08-29 21:26           ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).