From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
stable@vger.kernel.org, Eric Dumazet <edumazet@google.com>,
Neal Cardwell <ncardwell@google.com>,
Yuchung Cheng <ycheng@google.com>, Van Jacobson <vanj@google.com>,
Tom Herbert <therbert@google.com>,
"David S. Miller" <davem@davemloft.net>
Subject: [PATCH 3.10 01/54] tcp: TSO packets automatic sizing
Date: Fri, 1 Nov 2013 15:03:29 -0700 [thread overview]
Message-ID: <20131101220211.489101803@linuxfoundation.org> (raw)
In-Reply-To: <20131101220211.311926234@linuxfoundation.org>
3.10-stable review patch. If anyone has any objections, please let me know.
------------------
From: Eric Dumazet <edumazet@google.com>
[ Upstream commits 6d36824e730f247b602c90e8715a792003e3c5a7,
02cf4ebd82ff0ac7254b88e466820a290ed8289a, and parts of
7eec4174ff29cd42f2acfae8112f51c228545d40 ]
After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.
One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.
This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.
This field could be set by other transports.
Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.
For other flows, this helps better packet scheduling and ACK clocking.
This patch increases performance of TCP flows in lossy environments.
A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).
A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.
This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.
sk_pacing_rate = 2 * cwnd * mss / srtt
v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Tom Herbert <therbert@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
Documentation/networking/ip-sysctl.txt | 9 ++++++++
include/net/sock.h | 2 +
include/net/tcp.h | 1
net/core/sock.c | 1
net/ipv4/sysctl_net_ipv4.c | 10 +++++++++
net/ipv4/tcp.c | 28 ++++++++++++++++++++++-----
net/ipv4/tcp_input.c | 34 ++++++++++++++++++++++++++++++++-
net/ipv4/tcp_output.c | 2 -
8 files changed, 80 insertions(+), 7 deletions(-)
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -478,6 +478,15 @@ tcp_syn_retries - INTEGER
tcp_timestamps - BOOLEAN
Enable timestamps as defined in RFC1323.
+tcp_min_tso_segs - INTEGER
+ Minimal number of segments per TSO frame.
+ Since linux-3.12, TCP does an automatic sizing of TSO frames,
+ depending on flow rate, instead of filling 64Kbytes packets.
+ For specific usages, it's possible to force TCP to build big
+ TSO frames. Note that TCP stack might split too big TSO packets
+ if available window is too small.
+ Default: 2
+
tcp_tso_win_divisor - INTEGER
This allows control over what percentage of the congestion window
can be consumed by a single TSO frame.
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -230,6 +230,7 @@ struct cg_proto;
* @sk_wmem_queued: persistent queue size
* @sk_forward_alloc: space allocated forward
* @sk_allocation: allocation mode
+ * @sk_pacing_rate: Pacing rate (if supported by transport/packet scheduler)
* @sk_sndbuf: size of send buffer in bytes
* @sk_flags: %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
* %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
@@ -355,6 +356,7 @@ struct sock {
kmemcheck_bitfield_end(flags);
int sk_wmem_queued;
gfp_t sk_allocation;
+ u32 sk_pacing_rate; /* bytes per second */
netdev_features_t sk_route_caps;
netdev_features_t sk_route_nocaps;
int sk_gso_type;
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -287,6 +287,7 @@ extern int sysctl_tcp_thin_dupack;
extern int sysctl_tcp_early_retrans;
extern int sysctl_tcp_limit_output_bytes;
extern int sysctl_tcp_challenge_ack_limit;
+extern int sysctl_tcp_min_tso_segs;
extern atomic_long_t tcp_memory_allocated;
extern struct percpu_counter tcp_sockets_allocated;
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2271,6 +2271,7 @@ void sock_init_data(struct socket *sock,
sk->sk_stamp = ktime_set(-1L, 0);
+ sk->sk_pacing_rate = ~0U;
/*
* Before updating sk_refcnt, we must commit prior changes to memory
* (Documentation/RCU/rculist_nulls.txt for details)
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -29,6 +29,7 @@
static int zero;
static int one = 1;
static int four = 4;
+static int gso_max_segs = GSO_MAX_SEGS;
static int tcp_retr1_max = 255;
static int ip_local_port_range_min[] = { 1, 1 };
static int ip_local_port_range_max[] = { 65535, 65535 };
@@ -753,6 +754,15 @@ static struct ctl_table ipv4_table[] = {
.extra2 = &four,
},
{
+ .procname = "tcp_min_tso_segs",
+ .data = &sysctl_tcp_min_tso_segs,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &gso_max_segs,
+ },
+ {
.procname = "udp_mem",
.data = &sysctl_udp_mem,
.maxlen = sizeof(sysctl_udp_mem),
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -282,6 +282,8 @@
int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
+int sysctl_tcp_min_tso_segs __read_mostly = 2;
+
struct percpu_counter tcp_orphan_count;
EXPORT_SYMBOL_GPL(tcp_orphan_count);
@@ -786,12 +788,28 @@ static unsigned int tcp_xmit_size_goal(s
xmit_size_goal = mss_now;
if (large_allowed && sk_can_gso(sk)) {
- xmit_size_goal = ((sk->sk_gso_max_size - 1) -
- inet_csk(sk)->icsk_af_ops->net_header_len -
- inet_csk(sk)->icsk_ext_hdr_len -
- tp->tcp_header_len);
+ u32 gso_size, hlen;
- /* TSQ : try to have two TSO segments in flight */
+ /* Maybe we should/could use sk->sk_prot->max_header here ? */
+ hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
+ inet_csk(sk)->icsk_ext_hdr_len +
+ tp->tcp_header_len;
+
+ /* Goal is to send at least one packet per ms,
+ * not one big TSO packet every 100 ms.
+ * This preserves ACK clocking and is consistent
+ * with tcp_tso_should_defer() heuristic.
+ */
+ gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC);
+ gso_size = max_t(u32, gso_size,
+ sysctl_tcp_min_tso_segs * mss_now);
+
+ xmit_size_goal = min_t(u32, gso_size,
+ sk->sk_gso_max_size - 1 - hlen);
+
+ /* TSQ : try to have at least two segments in flight
+ * (one in NIC TX ring, another in Qdisc)
+ */
xmit_size_goal = min_t(u32, xmit_size_goal,
sysctl_tcp_limit_output_bytes >> 1);
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -699,6 +699,34 @@ static void tcp_rtt_estimator(struct soc
}
}
+/* Set the sk_pacing_rate to allow proper sizing of TSO packets.
+ * Note: TCP stack does not yet implement pacing.
+ * FQ packet scheduler can be used to implement cheap but effective
+ * TCP pacing, to smooth the burst on large writes when packets
+ * in flight is significantly lower than cwnd (or rwin)
+ */
+static void tcp_update_pacing_rate(struct sock *sk)
+{
+ const struct tcp_sock *tp = tcp_sk(sk);
+ u64 rate;
+
+ /* set sk_pacing_rate to 200 % of current rate (mss * cwnd / srtt) */
+ rate = (u64)tp->mss_cache * 2 * (HZ << 3);
+
+ rate *= max(tp->snd_cwnd, tp->packets_out);
+
+ /* Correction for small srtt : minimum srtt being 8 (1 jiffy << 3),
+ * be conservative and assume srtt = 1 (125 us instead of 1.25 ms)
+ * We probably need usec resolution in the future.
+ * Note: This also takes care of possible srtt=0 case,
+ * when tcp_rtt_estimator() was not yet called.
+ */
+ if (tp->srtt > 8 + 2)
+ do_div(rate, tp->srtt);
+
+ sk->sk_pacing_rate = min_t(u64, rate, ~0U);
+}
+
/* Calculate rto without backoff. This is the second half of Van Jacobson's
* routine referred to above.
*/
@@ -3330,7 +3358,7 @@ static int tcp_ack(struct sock *sk, cons
u32 ack_seq = TCP_SKB_CB(skb)->seq;
u32 ack = TCP_SKB_CB(skb)->ack_seq;
bool is_dupack = false;
- u32 prior_in_flight;
+ u32 prior_in_flight, prior_cwnd = tp->snd_cwnd, prior_rtt = tp->srtt;
u32 prior_fackets;
int prior_packets = tp->packets_out;
int prior_sacked = tp->sacked_out;
@@ -3438,6 +3466,8 @@ static int tcp_ack(struct sock *sk, cons
if (icsk->icsk_pending == ICSK_TIME_RETRANS)
tcp_schedule_loss_probe(sk);
+ if (tp->srtt != prior_rtt || tp->snd_cwnd != prior_cwnd)
+ tcp_update_pacing_rate(sk);
return 1;
no_queue:
@@ -5736,6 +5766,8 @@ int tcp_rcv_state_process(struct sock *s
} else
tcp_init_metrics(sk);
+ tcp_update_pacing_rate(sk);
+
/* Prevent spurious tcp_cwnd_restart() on
* first data packet.
*/
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1623,7 +1623,7 @@ static bool tcp_tso_should_defer(struct
/* If a full-sized TSO skb can be sent, do it. */
if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
- sk->sk_gso_max_segs * tp->mss_cache))
+ tp->xmit_size_goal_segs * tp->mss_cache))
goto send_now;
/* Middle in queue won't get any more data, full sendable already? */
next prev parent reply other threads:[~2013-11-01 22:04 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-11-01 22:03 [PATCH 3.10 00/54] 3.10.18-stable review Greg Kroah-Hartman
2013-11-01 22:03 ` Greg Kroah-Hartman [this message]
2013-11-01 22:03 ` [PATCH 3.10 02/54] tcp: TSQ can use a dynamic limit Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 03/54] tcp: must unclone packets before mangling them Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 04/54] tcp: do not forget FIN in tcp_shifted_skb() Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 05/54] tcp: fix incorrect ca_state in tail loss probe Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 06/54] net: do not call sock_put() on TIMEWAIT sockets Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 07/54] l2tp: fix kernel panic when using IPv4-mapped IPv6 addresses Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 08/54] l2tp: Fix build warning with ipv6 disabled Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 09/54] net: mv643xx_eth: update statistics timer from timer context only Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 10/54] net: mv643xx_eth: fix orphaned statistics timer crash Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 11/54] net: heap overflow in __audit_sockaddr() Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 12/54] proc connector: fix info leaks Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 13/54] ipv4: fix ineffective source address selection Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 14/54] can: dev: fix nlmsg size calculation in can_get_size() Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 15/54] net: secure_seq: Fix warning when CONFIG_IPV6 and CONFIG_INET are not selected Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 16/54] xen-netback: Dont destroy the netdev until the vif is shut down Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 17/54] net: vlan: fix nlmsg size calculation in vlan_get_size() Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 18/54] vti: get rid of nf mark rule in prerouting Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 19/54] l2tp: must disable bh before calling l2tp_xmit_skb() Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 20/54] farsync: fix info leak in ioctl Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 21/54] unix_diag: fix info leak Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 22/54] connector: use nlmsg_len() to check message length Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 23/54] bnx2x: record rx queue for LRO packets Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 24/54] virtio-net: dont respond to cpu hotplug notifier if were not ready Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 25/54] virtio-net: fix the race between channels setting and refill Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 26/54] virtio-net: refill only when device is up during setting queues Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 27/54] bridge: Correctly clamp MAX forward_delay when enabling STP Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 28/54] net: dst: provide accessor function to dst->xfrm Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 29/54] sctp: Use software crc32 checksum when xfrm transform will happen Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 30/54] sctp: Perform software checksum if packet has to be fragmented Greg Kroah-Hartman
2013-11-01 22:03 ` [PATCH 3.10 31/54] wanxl: fix info leak in ioctl Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 32/54] be2net: pass if_id for v1 and V2 versions of TX_CREATE cmd Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 33/54] net: unix: inherit SOCK_PASS{CRED, SEC} flags from socket to fix race Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 34/54] net: fix cipso packet validation when !NETLABEL Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 35/54] inet: fix possible memory corruption with UDP_CORK and UFO Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 36/54] ipv6: always prefer rt6i_gateway if present Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 37/54] ipv6: fill rt6i_gateway with nexthop address Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 38/54] netfilter: nf_conntrack: fix rt6i_gateway checks for H.323 helper Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 39/54] ipv6: probe routes asynchronous in rt6_probe Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 40/54] davinci_emac.c: Fix IFF_ALLMULTI setup Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 41/54] ARM: 7851/1: check for number of arguments in syscall_get/set_arguments() Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 42/54] ARM: integrator: deactivate timer0 on the Integrator/CP Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 43/54] gpio/lynxpoint: check if the interrupt is enabled in IRQ handler Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 44/54] dm snapshot: fix data corruption Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 45/54] i2c: ismt: initialize DMA buffer Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 46/54] mm: fix BUG in __split_huge_page_pmd Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 47/54] ALSA: us122l: Fix pcm_usb_stream mmapping regression Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 48/54] ALSA: hda - Fix inverted internal mic not indicated on some machines Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 49/54] writeback: fix negative bdi max pause Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 50/54] wireless: radiotap: fix parsing buffer overrun Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 51/54] serial: vt8500: add missing braces Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 52/54] USB: serial: ti_usb_3410_5052: add Abbott strip port ID to combined table as well Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 53/54] USB: serial: option: add support for Inovia SEW858 device Greg Kroah-Hartman
2013-11-01 22:04 ` [PATCH 3.10 54/54] usb: serial: option: blacklist Olivetti Olicard200 Greg Kroah-Hartman
2013-11-02 2:30 ` [PATCH 3.10 00/54] 3.10.18-stable review Guenter Roeck
2013-11-02 21:32 ` Shuah Khan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131101220211.489101803@linuxfoundation.org \
--to=gregkh@linuxfoundation.org \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=ncardwell@google.com \
--cc=stable@vger.kernel.org \
--cc=therbert@google.com \
--cc=vanj@google.com \
--cc=ycheng@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox