All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Neal Cardwell <ncardwell@google.com>,
	Yuchung Cheng <ycheng@google.com>,
	Eric Dumazet <edumazet@google.com>,
	Soheil Hassas Yeganeh <soheil@google.com>,
	"David S. Miller" <davem@davemloft.net>
Subject: [PATCH 4.14 26/52] tcp: when scheduling TLP, time of RTO should account for current ACK
Date: Fri, 15 Dec 2017 10:52:03 +0100	[thread overview]
Message-ID: <20171215092309.888125381@linuxfoundation.org> (raw)
In-Reply-To: <20171215092308.500651185@linuxfoundation.org>

4.14-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Neal Cardwell <ncardwell@google.com>


[ Upstream commit ed66dfaf236c04d414de1d218441296e57fb2bd2 ]

Fix the TLP scheduling logic so that when scheduling a TLP probe, we
ensure that the estimated time at which an RTO would fire accounts for
the fact that ACKs indicating forward progress should push back RTO
times.

After the following fix:

df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")

we had an unintentional behavior change in the following kind of
scenario: suppose the RTT variance has been very low recently. Then
suppose we send out a flight of N packets and our RTT is 100ms:

t=0: send a flight of N packets
t=100ms: receive an ACK for N-1 packets

The response before df92c8394e6e that was:
  -> schedule a TLP for now + RTO_interval

The response after df92c8394e6e is:
  -> schedule a TLP for t=0 + RTO_interval

Since RTO_interval = srtt + RTT_variance, this means that we have
scheduled a TLP timer at a point in the future that only accounts for
RTT_variance. If the RTT_variance term is small, this means that the
timer fires soon.

Before df92c8394e6e this would not happen, because in that code, when
we receive an ACK for a prefix of flight, we did:

    1) Near the top of tcp_ack(), switch from TLP timer to RTO
       at write_queue_head->paket_tx_time + RTO_interval:
            if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
                   tcp_rearm_rto(sk);

    2) In tcp_clean_rtx_queue(), update the RTO to now + RTO_interval:
            if (flag & FLAG_ACKED) {
                   tcp_rearm_rto(sk);

    3) In tcp_ack() after tcp_fastretrans_alert() switch from RTO
       to TLP at now + RTO_interval:
            if (icsk->icsk_pending == ICSK_TIME_RETRANS)
                   tcp_schedule_loss_probe(sk);

In df92c8394e6e we removed that 3-phase dance, and instead directly
set the TLP timer once: we set the TLP timer in cases like this to
write_queue_head->packet_tx_time + RTO_interval. So if the RTT
variance is small, then this means that this is setting the TLP timer
to fire quite soon. This means if the ACK for the tail of the flight
takes longer than an RTT to arrive (often due to delayed ACKs), then
the TLP timer fires too quickly.

Fixes: df92c8394e6e ("tcp: fix xmit timer to only be reset if data ACKed/SACKed")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 include/net/tcp.h     |    2 +-
 net/ipv4/tcp_input.c  |    2 +-
 net/ipv4/tcp_output.c |    8 +++++---
 3 files changed, 7 insertions(+), 5 deletions(-)

--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -563,7 +563,7 @@ void tcp_push_one(struct sock *, unsigne
 void tcp_send_ack(struct sock *sk);
 void tcp_send_delayed_ack(struct sock *sk);
 void tcp_send_loss_probe(struct sock *sk);
-bool tcp_schedule_loss_probe(struct sock *sk);
+bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto);
 void tcp_skb_collapse_tstamp(struct sk_buff *skb,
 			     const struct sk_buff *next_skb);
 
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3021,7 +3021,7 @@ void tcp_rearm_rto(struct sock *sk)
 /* Try to schedule a loss probe; if that doesn't work, then schedule an RTO. */
 static void tcp_set_xmit_timer(struct sock *sk)
 {
-	if (!tcp_schedule_loss_probe(sk))
+	if (!tcp_schedule_loss_probe(sk, true))
 		tcp_rearm_rto(sk);
 }
 
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2337,7 +2337,7 @@ repair:
 
 		/* Send one loss probe per tail loss episode. */
 		if (push_one != 2)
-			tcp_schedule_loss_probe(sk);
+			tcp_schedule_loss_probe(sk, false);
 		is_cwnd_limited |= (tcp_packets_in_flight(tp) >= tp->snd_cwnd);
 		tcp_cwnd_validate(sk, is_cwnd_limited);
 		return false;
@@ -2345,7 +2345,7 @@ repair:
 	return !tp->packets_out && tcp_send_head(sk);
 }
 
-bool tcp_schedule_loss_probe(struct sock *sk)
+bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -2384,7 +2384,9 @@ bool tcp_schedule_loss_probe(struct sock
 	}
 
 	/* If the RTO formula yields an earlier time, then use that time. */
-	rto_delta_us = tcp_rto_delta_us(sk);  /* How far in future is RTO? */
+	rto_delta_us = advancing_rto ?
+			jiffies_to_usecs(inet_csk(sk)->icsk_rto) :
+			tcp_rto_delta_us(sk);  /* How far in future is RTO? */
 	if (rto_delta_us > 0)
 		timeout = min_t(u32, timeout, usecs_to_jiffies(rto_delta_us));
 

  parent reply	other threads:[~2017-12-15 10:07 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-15  9:51 [PATCH 4.14 00/52] 4.14.7-stable review Greg Kroah-Hartman
2017-12-15  9:51 ` [PATCH 4.14 02/52] net: thunderx: Fix TCP/UDP checksum offload for IPv6 pkts Greg Kroah-Hartman
2017-12-15  9:51 ` [PATCH 4.14 03/52] net: thunderx: Fix TCP/UDP checksum offload for IPv4 pkts Greg Kroah-Hartman
2017-12-15  9:51 ` [PATCH 4.14 05/52] s390/qeth: fix early exit from error path Greg Kroah-Hartman
2017-12-15  9:51 ` [PATCH 4.14 06/52] tipc: fix memory leak in tipc_accept_from_sock() Greg Kroah-Hartman
2017-12-15  9:51 ` [PATCH 4.14 07/52] vhost: fix skb leak in handle_rx() Greg Kroah-Hartman
2017-12-15  9:51 ` [PATCH 4.14 09/52] sit: update frag_off info Greg Kroah-Hartman
2017-12-15  9:51 ` [PATCH 4.14 10/52] tcp: add tcp_v4_fill_cb()/tcp_v4_restore_cb() Greg Kroah-Hartman
2017-12-15  9:51 ` [PATCH 4.14 11/52] packet: fix crash in fanout_demux_rollover() Greg Kroah-Hartman
2017-12-15  9:51 ` [PATCH 4.14 12/52] net/packet: fix a race in packet_bind() and packet_notifier() Greg Kroah-Hartman
2017-12-15  9:51 ` [PATCH 4.14 13/52] tcp: remove buggy call to tcp_v6_restore_cb() Greg Kroah-Hartman
2017-12-15  9:51 ` [PATCH 4.14 16/52] stmmac: reset last TSO segment size after device open Greg Kroah-Hartman
2017-12-15  9:51 ` [PATCH 4.14 18/52] s390/qeth: build max size GSO skbs on L2 devices Greg Kroah-Hartman
2017-12-15  9:51 ` [PATCH 4.14 19/52] s390/qeth: fix thinko in IPv4 multicast address tracking Greg Kroah-Hartman
2017-12-15  9:51 ` [PATCH 4.14 20/52] s390/qeth: fix GSO throughput regression Greg Kroah-Hartman
2017-12-15  9:51 ` [PATCH 4.14 21/52] tcp: use IPCB instead of TCP_SKB_CB in inet_exact_dif_match() Greg Kroah-Hartman
2017-12-15  9:51 ` [PATCH 4.14 22/52] tipc: call tipc_rcv() only if bearer is up in tipc_udp_recv() Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 23/52] tcp: use current time in tcp_rcv_space_adjust() Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 24/52] net: sched: cbq: create block for q->link.block Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 25/52] tap: free skb if flags error Greg Kroah-Hartman
2017-12-15  9:52 ` Greg Kroah-Hartman [this message]
2017-12-15  9:52 ` [PATCH 4.14 27/52] tun: free skb in early errors Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 28/52] net: ipv6: Fixup device for anycast routes during copy Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 29/52] tun: fix rcu_read_lock imbalance in tun_build_skb Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 30/52] net: accept UFO datagrams from tuntap and packet Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 31/52] net: openvswitch: datapath: fix data type in queue_gso_packets Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 32/52] cls_bpf: dont decrement nets refcount when offload fails Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 33/52] sctp: use right member as the param of list_for_each_entry Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 34/52] ipmi: Stop timers before cleaning up the module Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 35/52] usb: gadget: ffs: Forbid usb_ep_alloc_request from sleeping Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 36/52] fcntl: dont cap l_start and l_end values for F_GETLK64 in compat syscall Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 37/52] fix kcm_clone() Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 38/52] KVM: arm/arm64: vgic-its: Preserve the revious read from the pending table Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 39/52] kbuild: do not call cc-option before KBUILD_CFLAGS initialization Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 40/52] powerpc/powernv/idle: Round up latency and residency values Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 41/52] ipvlan: fix ipv6 outbound device Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 42/52] ide: ide-atapi: fix compile error with defining macro DEBUG Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 43/52] blk-mq: Avoid that request queue removal can trigger list corruption Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 44/52] nvmet-rdma: update queue list during ib_device removal Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 45/52] audit: Allow auditd to set pid to 0 to end auditing Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 46/52] audit: ensure that audit=1 actually enables audit for PID 1 Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 47/52] dm raid: fix panic when attempting to force a raid to sync Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 48/52] md: free unused memory after bitmap resize Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 49/52] RDMA/cxgb4: Annotate r2 and stag as __be32 Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 50/52] x86/intel_rdt: Fix potential deadlock during resctrl unmount Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 51/52] media: dvb-core: always call invoke_release() in fe_free() Greg Kroah-Hartman
2017-12-15  9:52 ` [PATCH 4.14 52/52] dvb_frontend: dont use-after-free the frontend struct Greg Kroah-Hartman
2017-12-15 10:09 ` [PATCH 4.14 00/52] 4.14.7-stable review Nikola Ciprich
2017-12-15 10:09   ` Nikola Ciprich
2017-12-15 13:07   ` Greg Kroah-Hartman
2017-12-15 17:41 ` Guenter Roeck
2017-12-15 18:27   ` Greg Kroah-Hartman
2017-12-15 21:12 ` Shuah Khan
2017-12-15 21:32   ` Greg Kroah-Hartman
2017-12-16  5:28 ` Naresh Kamboju
2017-12-16  8:23   ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171215092309.888125381@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ncardwell@google.com \
    --cc=soheil@google.com \
    --cc=stable@vger.kernel.org \
    --cc=ycheng@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.