From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jason Wang <jasowang@redhat.com>
Subject: Re: [PATCH v3 net-next] tcp: TSO packets automatic sizing
Date: Wed, 28 Aug 2013 15:37:34 +0800
Message-ID: <521DA8BE.3060709@redhat.com>
References: <1377304192.8828.43.camel@edumazet-glaptop>  <1377491162.8828.116.camel@edumazet-glaptop> <1377607592.8828.149.camel@edumazet-glaptop>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Cc: David Miller <davem@davemloft.net>,
	netdev <netdev@vger.kernel.org>,
	Neal Cardwell <ncardwell@google.com>,
	Yuchung Cheng <ycheng@google.com>,
	Van Jacobson <vanj@google.com>,
	Tom Herbert <therbert@google.com>,
	"Michael S. Tsirkin" <mst@redhat.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:49758 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1754433Ab3H1Hho (ORCPT <rfc822;netdev@vger.kernel.org>);
	Wed, 28 Aug 2013 03:37:44 -0400
In-Reply-To: <1377607592.8828.149.camel@edumazet-glaptop>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On 08/27/2013 08:46 PM, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> After hearing many people over past years complaining against TSO being
> bursty or even buggy, we are proud to present automatic sizing of TSO
> packets.
>
> One part of the problem is that tcp_tso_should_defer() uses an heuristic
> relying on upcoming ACKS instead of a timer, but more generally, having
> big TSO packets makes little sense for low rates, as it tends to create
> micro bursts on the network, and general consensus is to reduce the
> buffering amount.
>
> This patch introduces a per socket sk_pacing_rate, that approximates
> the current sending rate, and allows us to size the TSO packets so
> that we try to send one packet every ms.
>
> This field could be set by other transports.
>
> Patch has no impact for high speed flows, where having large TSO packets
> makes sense to reach line rate.
>
> For other flows, this helps better packet scheduling and ACK clocking.
>
> This patch increases performance of TCP flows in lossy environments.
>
> A new sysctl (tcp_min_tso_segs) is added, to specify the
> minimal size of a TSO packet (default being 2).
>
> A follow-up patch will provide a new packet scheduler (FQ), using
> sk_pacing_rate as an input to perform optional per flow pacing.
>
> This explains why we chose to set sk_pacing_rate to twice the current
> rate, allowing 'slow start' ramp up.
>
> sk_pacing_rate = 2 * cwnd * mss / srtt
>  
> v2: Neal Cardwell reported a suspect deferring of last two segments on
> initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
> into account tp->xmit_size_goal_segs 
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Van Jacobson <vanj@google.com>
> Cc: Tom Herbert <therbert@google.com>
> ---
> v3: The change Yuchung suggested added a possibility of a divide by 0:
>     On some (retransmits) case, srtt can be 0 because
>     tcp_rtt_estimator() has not yet been called.
>     Change the computation to remove this, and do not yet use usec
>     as the units, but HZ. [ Its interesting to see jiffies_to_usecs()
>     being an out of line function :( ]
>
> This version passed all our tests.
>
>  Documentation/networking/ip-sysctl.txt |    9 ++++++
>  include/net/sock.h                     |    2 +
>  include/net/tcp.h                      |    1 
>  net/ipv4/sysctl_net_ipv4.c             |   10 +++++++
>  net/ipv4/tcp.c                         |   28 ++++++++++++++++----
>  net/ipv4/tcp_input.c                   |   32 ++++++++++++++++++++++-
>  net/ipv4/tcp_output.c                  |    2 -
>  7 files changed, 77 insertions(+), 7 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index debfe85..ce5bb43 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -482,6 +482,15 @@ tcp_syn_retries - INTEGER
>  tcp_timestamps - BOOLEAN
>  	Enable timestamps as defined in RFC1323.
>  
> +tcp_min_tso_segs - INTEGER
> +	Minimal number of segments per TSO frame.
> +	Since linux-3.12, TCP does an automatic sizing of TSO frames,
> +	depending on flow rate, instead of filling 64Kbytes packets.
> +	For specific usages, it's possible to force TCP to build big
> +	TSO frames. Note that TCP stack might split too big TSO packets
> +	if available window is too small.
> +	Default: 2
> +
>  tcp_tso_win_divisor - INTEGER
>  	This allows control over what percentage of the congestion window
>  	can be consumed by a single TSO frame.
> diff --git a/include/net/sock.h b/include/net/sock.h
> index e4bbcbf..6ba2e7b 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -232,6 +232,7 @@ struct cg_proto;
>    *	@sk_napi_id: id of the last napi context to receive data for sk
>    *	@sk_ll_usec: usecs to busypoll when there is no data
>    *	@sk_allocation: allocation mode
> +  *	@sk_pacing_rate: Pacing rate (if supported by transport/packet scheduler)
>    *	@sk_sndbuf: size of send buffer in bytes
>    *	@sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
>    *		   %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
> @@ -361,6 +362,7 @@ struct sock {
>  	kmemcheck_bitfield_end(flags);
>  	int			sk_wmem_queued;
>  	gfp_t			sk_allocation;
> +	u32			sk_pacing_rate; /* bytes per second */
>  	netdev_features_t	sk_route_caps;
>  	netdev_features_t	sk_route_nocaps;
>  	int			sk_gso_type;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 09cb5c1..73fcd7c 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -281,6 +281,7 @@ extern int sysctl_tcp_early_retrans;
>  extern int sysctl_tcp_limit_output_bytes;
>  extern int sysctl_tcp_challenge_ack_limit;
>  extern unsigned int sysctl_tcp_notsent_lowat;
> +extern int sysctl_tcp_min_tso_segs;
>  
>  extern atomic_long_t tcp_memory_allocated;
>  extern struct percpu_counter tcp_sockets_allocated;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 8ed7c32..540279f 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -29,6 +29,7 @@
>  static int zero;
>  static int one = 1;
>  static int four = 4;
> +static int gso_max_segs = GSO_MAX_SEGS;
>  static int tcp_retr1_max = 255;
>  static int ip_local_port_range_min[] = { 1, 1 };
>  static int ip_local_port_range_max[] = { 65535, 65535 };
> @@ -761,6 +762,15 @@ static struct ctl_table ipv4_table[] = {
>  		.extra2		= &four,
>  	},
>  	{
> +		.procname	= "tcp_min_tso_segs",
> +		.data		= &sysctl_tcp_min_tso_segs,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
> +		.extra1		= &zero,
> +		.extra2		= &gso_max_segs,
> +	},
> +	{
>  		.procname	= "udp_mem",
>  		.data		= &sysctl_udp_mem,
>  		.maxlen		= sizeof(sysctl_udp_mem),
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 4e42c03..fdf7409 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -283,6 +283,8 @@
>  
>  int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
>  
> +int sysctl_tcp_min_tso_segs __read_mostly = 2;
> +
>  struct percpu_counter tcp_orphan_count;
>  EXPORT_SYMBOL_GPL(tcp_orphan_count);
>  
> @@ -785,12 +787,28 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
>  	xmit_size_goal = mss_now;
>  
>  	if (large_allowed && sk_can_gso(sk)) {
> -		xmit_size_goal = ((sk->sk_gso_max_size - 1) -
> -				  inet_csk(sk)->icsk_af_ops->net_header_len -
> -				  inet_csk(sk)->icsk_ext_hdr_len -
> -				  tp->tcp_header_len);
> +		u32 gso_size, hlen;
> +
> +		/* Maybe we should/could use sk->sk_prot->max_header here ? */
> +		hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
> +		       inet_csk(sk)->icsk_ext_hdr_len +
> +		       tp->tcp_header_len;
> +
> +		/* Goal is to send at least one packet per ms,
> +		 * not one big TSO packet every 100 ms.
> +		 * This preserves ACK clocking and is consistent
> +		 * with tcp_tso_should_defer() heuristic.
> +		 */
> +		gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC);
> +		gso_size = max_t(u32, gso_size,
> +				 sysctl_tcp_min_tso_segs * mss_now);
> +
> +		xmit_size_goal = min_t(u32, gso_size,
> +				       sk->sk_gso_max_size - 1 - hlen);
>  
> -		/* TSQ : try to have two TSO segments in flight */
> +		/* TSQ : try to have at least two segments in flight
> +		 * (one in NIC TX ring, another in Qdisc)
> +		 */
>  		xmit_size_goal = min_t(u32, xmit_size_goal,
>  				       sysctl_tcp_limit_output_bytes >> 1);
>  
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index ec492ea..436c7e8 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -688,6 +688,34 @@ static void tcp_rtt_estimator(struct sock *sk, const __u32 mrtt)
>  	}
>  }
>  
> +/* Set the sk_pacing_rate to allow proper sizing of TSO packets.
> + * Note: TCP stack does not yet implement pacing.
> + * FQ packet scheduler can be used to implement cheap but effective
> + * TCP pacing, to smooth the burst on large writes when packets
> + * in flight is significantly lower than cwnd (or rwin)
> + */
> +static void tcp_update_pacing_rate(struct sock *sk)
> +{
> +	const struct tcp_sock *tp = tcp_sk(sk);
> +	u64 rate;
> +
> +	/* set sk_pacing_rate to 200 % of current rate (mss * cwnd / srtt) */
> +	rate = (u64)tp->mss_cache * 2 * (HZ << 3);
> +
> +	rate *= max(tp->snd_cwnd, tp->packets_out);
> +
> +	/* Correction for small srtt : minimum srtt being 8 (1 jiffy << 3),
> +	 * be conservative and assume srtt = 1 (125 us instead of 1.25 ms)
> +	 * We probably need usec resolution in the future.
> +	 * Note: This also takes care of possible srtt=0 case,
> +	 * when tcp_rtt_estimator() was not yet called.
> +	 */
> +	if (tp->srtt > 8 + 2)
> +		do_div(rate, tp->srtt);
> +
> +	sk->sk_pacing_rate = min_t(u64, rate, ~0U);
> +}
> +
>  /* Calculate rto without backoff.  This is the second half of Van Jacobson's
>   * routine referred to above.
>   */
> @@ -3278,7 +3306,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>  	u32 ack_seq = TCP_SKB_CB(skb)->seq;
>  	u32 ack = TCP_SKB_CB(skb)->ack_seq;
>  	bool is_dupack = false;
> -	u32 prior_in_flight;
> +	u32 prior_in_flight, prior_cwnd = tp->snd_cwnd, prior_rtt = tp->srtt;
>  	u32 prior_fackets;
>  	int prior_packets = tp->packets_out;
>  	const int prior_unsacked = tp->packets_out - tp->sacked_out;
> @@ -3383,6 +3411,8 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>  
>  	if (icsk->icsk_pending == ICSK_TIME_RETRANS)
>  		tcp_schedule_loss_probe(sk);
> +	if (tp->srtt != prior_rtt || tp->snd_cwnd != prior_cwnd)
> +		tcp_update_pacing_rate(sk);
>  	return 1;
>  
>  no_queue:
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 884efff..e63ae4c 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1631,7 +1631,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
>  
>  	/* If a full-sized TSO skb can be sent, do it. */
>  	if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
> -			   sk->sk_gso_max_segs * tp->mss_cache))
> +			   tp->xmit_size_goal_segs * tp->mss_cache))
>  		goto send_now;
A question is: Does this really guarantee the minimal TSO segments
excluding the case of small available window? The skb->len may be much
smaller and can still be sent here. Maybe we should check skb->len also?

>  
>  	/* Middle in queue won't get any more data, full sendable already? */
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html