[PATCH v4 1/3] tcp: early retransmit: tcp_enter

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 1/3] tcp: early retransmit: tcp_enter_recovery()
@ 2012-05-02 23:30 Yuchung Cheng
  2012-05-02 23:30 ` [PATCH v4 2/3] tcp: early retransmit Yuchung Cheng
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Yuchung Cheng @ 2012-05-02 23:30 UTC (permalink / raw)
  To: davem, ilpo.jarvinen; +Cc: ncardwell, nanditad, netdev, Yuchung Cheng

This a prepartion patch that refactors the code to enter recovery
into a new function tcp_enter_recovery(). It's needed to implement
the delayed fast retransmit in ER.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
ChangeLog since v1:
 - swaped with part 1 and part2
ChangeLog since v2:
  - removed RFC in commit message
ChangeLog since v3:
  - nothing

 net/ipv4/tcp_input.c |   61 +++++++++++++++++++++++++++----------------------
 1 files changed, 34 insertions(+), 27 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c93b0cb..22df826 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3022,6 +3022,38 @@ static void tcp_update_cwnd_in_recovery(struct sock *sk, int newly_acked_sacked,
 	tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
 }
 
+static void tcp_enter_recovery(struct sock *sk, bool ece_ack)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	int mib_idx;
+
+	if (tcp_is_reno(tp))
+		mib_idx = LINUX_MIB_TCPRENORECOVERY;
+	else
+		mib_idx = LINUX_MIB_TCPSACKRECOVERY;
+
+	NET_INC_STATS_BH(sock_net(sk), mib_idx);
+
+	tp->high_seq = tp->snd_nxt;
+	tp->prior_ssthresh = 0;
+	tp->undo_marker = tp->snd_una;
+	tp->undo_retrans = tp->retrans_out;
+
+	if (inet_csk(sk)->icsk_ca_state < TCP_CA_CWR) {
+		if (!ece_ack)
+			tp->prior_ssthresh = tcp_current_ssthresh(sk);
+		tp->snd_ssthresh = inet_csk(sk)->icsk_ca_ops->ssthresh(sk);
+		TCP_ECN_queue_cwr(tp);
+	}
+
+	tp->bytes_acked = 0;
+	tp->snd_cwnd_cnt = 0;
+	tp->prior_cwnd = tp->snd_cwnd;
+	tp->prr_delivered = 0;
+	tp->prr_out = 0;
+	tcp_set_ca_state(sk, TCP_CA_Recovery);
+}
+
 /* Process an event, which can update packets-in-flight not trivially.
  * Main goal of this function is to calculate new estimate for left_out,
  * taking into account both packets sitting in receiver's buffer and
@@ -3041,7 +3073,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
 	struct tcp_sock *tp = tcp_sk(sk);
 	int do_lost = is_dupack || ((flag & FLAG_DATA_SACKED) &&
 				    (tcp_fackets_out(tp) > tp->reordering));
-	int fast_rexmit = 0, mib_idx;
+	int fast_rexmit = 0;
 
 	if (WARN_ON(!tp->packets_out && tp->sacked_out))
 		tp->sacked_out = 0;
@@ -3142,32 +3174,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
 		}
 
 		/* Otherwise enter Recovery state */
-
-		if (tcp_is_reno(tp))
-			mib_idx = LINUX_MIB_TCPRENORECOVERY;
-		else
-			mib_idx = LINUX_MIB_TCPSACKRECOVERY;
-
-		NET_INC_STATS_BH(sock_net(sk), mib_idx);
-
-		tp->high_seq = tp->snd_nxt;
-		tp->prior_ssthresh = 0;
-		tp->undo_marker = tp->snd_una;
-		tp->undo_retrans = tp->retrans_out;
-
-		if (icsk->icsk_ca_state < TCP_CA_CWR) {
-			if (!(flag & FLAG_ECE))
-				tp->prior_ssthresh = tcp_current_ssthresh(sk);
-			tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
-			TCP_ECN_queue_cwr(tp);
-		}
-
-		tp->bytes_acked = 0;
-		tp->snd_cwnd_cnt = 0;
-		tp->prior_cwnd = tp->snd_cwnd;
-		tp->prr_delivered = 0;
-		tp->prr_out = 0;
-		tcp_set_ca_state(sk, TCP_CA_Recovery);
+		tcp_enter_recovery(sk, (flag & FLAG_ECE));
 		fast_rexmit = 1;
 	}
 
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 2/3] tcp: early retransmit
  2012-05-02 23:30 [PATCH v4 1/3] tcp: early retransmit: tcp_enter_recovery() Yuchung Cheng
@ 2012-05-02 23:30 ` Yuchung Cheng
  2012-05-03  0:14   ` Neal Cardwell
  2012-05-03  0:56   ` David Miller
  2012-05-02 23:30 ` [PATCH v4 3/3] tcp: early retransmit: delayed fast retransmit Yuchung Cheng
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 10+ messages in thread
From: Yuchung Cheng @ 2012-05-02 23:30 UTC (permalink / raw)
  To: davem, ilpo.jarvinen; +Cc: ncardwell, nanditad, netdev, Yuchung Cheng

This patch implements RFC 5827 early retransmit (ER) for TCP.
It reduces DUPACK threshold (dupthresh) if outstanding packets are
less than 4 to recover losses by fast recovery instead of timeout.

While the algorithm is simple, small but frequent network reordering
makes this feature dangerous: the connection repeatedly enter
false recovery and degrade performance. Therefore we implement
a mitigation suggested in the appendix of the RFC that delays
entering fast recovery by a small interval, i.e., RTT/4. Currently
ER is conservative and is disabled for the rest of the connection
after the first reordering event. A large scale web server
experiment on the performance impact of ER is summarized in
section 6 of the paper "Proportional Rate Reduction for TCP”,
IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf

Note that Linux has a similar feature called THIN_DUPACK. The
differences are THIN_DUPACK do not mitigate reorderings and is only
used after slow start. Currently ER is disabled if THIN_DUPACK is
enabled. I would be happy to merge THIN_DUPACK feature with ER if
people think it's a good idea.

ER is enabled by sysctl_tcp_early_retrans:
  0: Disables ER

  1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.

  2: (Default) reduce dupthresh like mode 1. In addition, delay
     entering fast recovery by RTT/4.

Note: mode 2 is implemented in the third part of this patch series.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
ChangeLog in v2:
 - swapped part1 and part2
ChangeLog in v3:
 - trivial text fixes in documentation.
ChangeLog in v4:
 - disables ER if any reordering. shuffle early retransmit vars in tcp_sock.

 Documentation/networking/ip-sysctl.txt |   14 ++++++++++++++
 include/linux/tcp.h                    |    1 +
 include/net/tcp.h                      |   15 +++++++++++++++
 net/ipv4/sysctl_net_ipv4.c             |   10 ++++++++++
 net/ipv4/tcp.c                         |    3 +++
 net/ipv4/tcp_input.c                   |   15 +++++++++++++++
 net/ipv4/tcp_minisocks.c               |    1 +
 7 files changed, 59 insertions(+), 0 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 9b569a2..34916e7 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -190,6 +190,20 @@ tcp_cookie_size - INTEGER
 tcp_dsack - BOOLEAN
 	Allows TCP to send "duplicate" SACKs.
 
+tcp_early_retrans - INTEGER
+	Enable Early Retransmit (ER), per RFC 5827. ER lowers the threshold
+	for triggering fast retransmit when the amount of outstanding data is
+	small and when no previously unsent data can be transmitted (such
+	that limited transmit could be used).
+	Possible values:
+		0 disables ER
+		1 enables ER
+		2 enables ER but delays fast recovery and fast retransmit
+		  by a fourth of RTT. This mitigates connection falsely
+		  recovers when network has a small degree of reordering
+		  (less than 3 packets).
+	Default: 2
+
 tcp_ecn - INTEGER
 	Enable Explicit Congestion Notification (ECN) in TCP. ECN is only
 	used when both ends of the TCP flow support it. It is useful to
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 278af9e..7859b41 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -372,6 +372,7 @@ struct tcp_sock {
 		repair      : 1,
 		unused      : 1;
 	u8	repair_queue;
+	u8	do_early_retrans:1;/* Enable RFC5827 early-retransmit  */
 
 /* RTT measurement */
 	u32	srtt;		/* smoothed round trip time << 3	*/
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0fb84de..685437a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -252,6 +252,7 @@ extern int sysctl_tcp_max_ssthresh;
 extern int sysctl_tcp_cookie_size;
 extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
+extern int sysctl_tcp_early_retrans;
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
@@ -797,6 +798,20 @@ static inline void tcp_enable_fack(struct tcp_sock *tp)
 	tp->rx_opt.sack_ok |= TCP_FACK_ENABLED;
 }
 
+/* TCP early-retransmit (ER) is similar to but more conservative than
+ * the thin-dupack feature.  Enable ER only if thin-dupack is disabled.
+ */
+static inline void tcp_enable_early_retrans(struct tcp_sock *tp)
+{
+	tp->do_early_retrans = sysctl_tcp_early_retrans &&
+		!sysctl_tcp_thin_dupack && sysctl_tcp_reordering == 3;
+}
+
+static inline void tcp_disable_early_retrans(struct tcp_sock *tp)
+{
+	tp->do_early_retrans = 0;
+}
+
 static inline unsigned int tcp_left_out(const struct tcp_sock *tp)
 {
 	return tp->sacked_out + tp->lost_out;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 33417f8..ef32956 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -27,6 +27,7 @@
 #include <net/tcp_memcontrol.h>
 
 static int zero;
+static int two = 2;
 static int tcp_retr1_max = 255;
 static int ip_local_port_range_min[] = { 1, 1 };
 static int ip_local_port_range_max[] = { 65535, 65535 };
@@ -677,6 +678,15 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler   = proc_dointvec
 	},
 	{
+		.procname	= "tcp_early_retrans",
+		.data		= &sysctl_tcp_early_retrans,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &two,
+	},
+	{
 		.procname	= "udp_mem",
 		.data		= &sysctl_udp_mem,
 		.maxlen		= sizeof(sysctl_udp_mem),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9670af3..6802c89 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -395,6 +395,7 @@ void tcp_init_sock(struct sock *sk)
 	tp->mss_cache = TCP_MSS_DEFAULT;
 
 	tp->reordering = sysctl_tcp_reordering;
+	tcp_enable_early_retrans(tp);
 	icsk->icsk_ca_ops = &tcp_init_congestion_ops;
 
 	sk->sk_state = TCP_CLOSE;
@@ -2495,6 +2496,8 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 			err = -EINVAL;
 		else
 			tp->thin_dupack = val;
+			if (tp->thin_dupack)
+				tcp_disable_early_retrans(tp);
 		break;
 
 	case TCP_REPAIR:
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 22df826..82194c7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -99,6 +99,7 @@ int sysctl_tcp_thin_dupack __read_mostly;
 
 int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
 int sysctl_tcp_abc __read_mostly;
+int sysctl_tcp_early_retrans __read_mostly = 2;
 
 #define FLAG_DATA		0x01 /* Incoming frame contained data.		*/
 #define FLAG_WIN_UPDATE		0x02 /* Incoming ACK was a window update.	*/
@@ -906,6 +907,7 @@ static void tcp_init_metrics(struct sock *sk)
 	if (dst_metric(dst, RTAX_REORDERING) &&
 	    tp->reordering != dst_metric(dst, RTAX_REORDERING)) {
 		tcp_disable_fack(tp);
+		tcp_disable_early_retrans(tp);
 		tp->reordering = dst_metric(dst, RTAX_REORDERING);
 	}
 
@@ -988,6 +990,9 @@ static void tcp_update_reordering(struct sock *sk, const int metric,
 #endif
 		tcp_disable_fack(tp);
 	}
+
+	if (metric > 0)
+		tcp_disable_early_retrans(tp);
 }
 
 /* This must be called before lost_out is incremented */
@@ -2492,6 +2497,16 @@ static int tcp_time_to_recover(struct sock *sk)
 	    tcp_is_sack(tp) && !tcp_send_head(sk))
 		return 1;
 
+	/* Trick#6: TCP early retransmit, per RFC5827.  To avoid spurious
+	 * retransmissions due to small network reorderings, we implement
+	 * Mitigation A.3 in the RFC and delay the retransmission for a short
+	 * interval if appropriate.
+	 */
+	if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out &&
+	    (tp->packets_out == (tp->sacked_out + 1) && tp->packets_out < 4) &&
+	    !tcp_may_send_now(sk))
+		return 1;
+
 	return 0;
 }
 
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 3cabafb..6f6a918 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -482,6 +482,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		newtp->sacked_out = 0;
 		newtp->fackets_out = 0;
 		newtp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
+		tcp_enable_early_retrans(newtp);
 
 		/* So many TCP implementations out there (incorrectly) count the
 		 * initial SYN frame in their delayed-ACK and congestion control
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v4 3/3] tcp: early retransmit: delayed fast retransmit
  2012-05-02 23:30 [PATCH v4 1/3] tcp: early retransmit: tcp_enter_recovery() Yuchung Cheng
  2012-05-02 23:30 ` [PATCH v4 2/3] tcp: early retransmit Yuchung Cheng
@ 2012-05-02 23:30 ` Yuchung Cheng
  2012-05-03  0:15   ` Neal Cardwell
  2012-05-03  0:56   ` David Miller
  2012-05-03  0:13 ` [PATCH v4 1/3] tcp: early retransmit: tcp_enter_recovery() Neal Cardwell
  2012-05-03  0:56 ` David Miller
  3 siblings, 2 replies; 10+ messages in thread
From: Yuchung Cheng @ 2012-05-02 23:30 UTC (permalink / raw)
  To: davem, ilpo.jarvinen; +Cc: ncardwell, nanditad, netdev, Yuchung Cheng

Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
Delays the fast retransmit by an interval of RTT/4. We borrow the
RTO timer to implement the delay. If we receive another ACK or send
a new packet, the timer is cancelled and restored to original RTO
value offset by time elapsed.  When the delayed-ER timer fires,
we enter fast recovery and perform fast retransmit.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
ChangeLog in v2:
 - Set sysctl_tcp_early_retrans default to 2
ChangeLog in v3:
  - use separate u8 for early retrans stats in tcp_sock
  - disable ER if detects any reordering
ChangeLog in v4:
  - move disabling ER code on reordering into part2

 include/linux/tcp.h   |    3 +-
 include/net/tcp.h     |    3 ++
 net/ipv4/tcp_input.c  |   69 ++++++++++++++++++++++++++++++++++++++++++++-----
 net/ipv4/tcp_output.c |    5 +--
 net/ipv4/tcp_timer.c  |    5 +++
 5 files changed, 74 insertions(+), 11 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 7859b41..d9b42c5be 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -372,7 +372,8 @@ struct tcp_sock {
 		repair      : 1,
 		unused      : 1;
 	u8	repair_queue;
-	u8	do_early_retrans:1;/* Enable RFC5827 early-retransmit  */
+	u8	do_early_retrans:1,/* Enable RFC5827 early-retransmit  */
+		early_retrans_delayed:1; /* Delayed ER timer installed */
 
 /* RTT measurement */
 	u32	srtt;		/* smoothed round trip time << 3	*/
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 685437a..5283aa4 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -500,6 +500,8 @@ extern void tcp_send_delayed_ack(struct sock *sk);
 
 /* tcp_input.c */
 extern void tcp_cwnd_application_limited(struct sock *sk);
+extern void tcp_resume_early_retransmit(struct sock *sk);
+extern void tcp_rearm_rto(struct sock *sk);
 
 /* tcp_timer.c */
 extern void tcp_init_xmit_timers(struct sock *);
@@ -805,6 +807,7 @@ static inline void tcp_enable_early_retrans(struct tcp_sock *tp)
 {
 	tp->do_early_retrans = sysctl_tcp_early_retrans &&
 		!sysctl_tcp_thin_dupack && sysctl_tcp_reordering == 3;
+	tp->early_retrans_delayed = 0;
 }
 
 static inline void tcp_disable_early_retrans(struct tcp_sock *tp)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 82194c7..6c0bc6f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2344,6 +2344,27 @@ static inline int tcp_dupack_heuristics(const struct tcp_sock *tp)
 	return tcp_is_fack(tp) ? tp->fackets_out : tp->sacked_out + 1;
 }
 
+static bool tcp_pause_early_retransmit(struct sock *sk, int flag)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	unsigned long delay;
+
+	/* Delay early retransmit and entering fast recovery for
+	 * max(RTT/4, 2msec) unless ack has ECE mark, no RTT samples
+	 * available, or RTO is scheduled to fire first.
+	 */
+	if (sysctl_tcp_early_retrans < 2 || (flag & FLAG_ECE) || !tp->srtt)
+		return false;
+
+	delay = max_t(unsigned long, (tp->srtt >> 5), msecs_to_jiffies(2));
+	if (!time_after(inet_csk(sk)->icsk_timeout, (jiffies + delay)))
+		return false;
+
+	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, delay, TCP_RTO_MAX);
+	tp->early_retrans_delayed = 1;
+	return true;
+}
+
 static inline int tcp_skb_timedout(const struct sock *sk,
 				   const struct sk_buff *skb)
 {
@@ -2451,7 +2472,7 @@ static inline int tcp_head_timedout(const struct sock *sk)
  * Main question: may we further continue forward transmission
  * with the same cwnd?
  */
-static int tcp_time_to_recover(struct sock *sk)
+static int tcp_time_to_recover(struct sock *sk, int flag)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	__u32 packets_out;
@@ -2505,7 +2526,7 @@ static int tcp_time_to_recover(struct sock *sk)
 	if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out &&
 	    (tp->packets_out == (tp->sacked_out + 1) && tp->packets_out < 4) &&
 	    !tcp_may_send_now(sk))
-		return 1;
+		return !tcp_pause_early_retransmit(sk, flag);
 
 	return 0;
 }
@@ -3172,7 +3193,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
 		if (icsk->icsk_ca_state <= TCP_CA_Disorder)
 			tcp_try_undo_dsack(sk);
 
-		if (!tcp_time_to_recover(sk)) {
+		if (!tcp_time_to_recover(sk, flag)) {
 			tcp_try_to_open(sk, flag);
 			return;
 		}
@@ -3271,16 +3292,47 @@ static void tcp_cong_avoid(struct sock *sk, u32 ack, u32 in_flight)
 /* Restart timer after forward progress on connection.
  * RFC2988 recommends to restart timer to now+rto.
  */
-static void tcp_rearm_rto(struct sock *sk)
+void tcp_rearm_rto(struct sock *sk)
 {
-	const struct tcp_sock *tp = tcp_sk(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
 
 	if (!tp->packets_out) {
 		inet_csk_clear_xmit_timer(sk, ICSK_TIME_RETRANS);
 	} else {
-		inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
-					  inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
+		u32 rto = inet_csk(sk)->icsk_rto;
+		/* Offset the time elapsed after installing regular RTO */
+		if (tp->early_retrans_delayed) {
+			struct sk_buff *skb = tcp_write_queue_head(sk);
+			const u32 rto_time_stamp = TCP_SKB_CB(skb)->when + rto;
+			s32 delta = (s32)(rto_time_stamp - tcp_time_stamp);
+			/* delta may not be positive if the socket is locked
+			 * when the delayed ER timer fires and is rescheduled.
+			 */
+			if (delta > 0)
+				rto = delta;
+		}
+		inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
+					  TCP_RTO_MAX);
 	}
+	tp->early_retrans_delayed = 0;
+}
+
+/* This function is called when the delayed ER timer fires. TCP enters
+ * fast recovery and performs fast-retransmit.
+ */
+void tcp_resume_early_retransmit(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	tcp_rearm_rto(sk);
+
+	/* Stop if ER is disabled after the delayed ER timer is scheduled */
+	if (!tp->do_early_retrans)
+		return;
+
+	tcp_enter_recovery(sk, false);
+	tcp_update_scoreboard(sk, 1);
+	tcp_xmit_retransmit_queue(sk);
 }
 
 /* If we get here, the whole TSO packet has not been acked. */
@@ -3729,6 +3781,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	if (after(ack, tp->snd_nxt))
 		goto invalid_ack;
 
+	if (tp->early_retrans_delayed)
+		tcp_rearm_rto(sk);
+
 	if (after(ack, prior_snd_una))
 		flag |= FLAG_SND_UNA_ADVANCED;
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 834e89f..d947330 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -78,9 +78,8 @@ static void tcp_event_new_data_sent(struct sock *sk, const struct sk_buff *skb)
 		tp->frto_counter = 3;
 
 	tp->packets_out += tcp_skb_pcount(skb);
-	if (!prior_packets)
-		inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
-					  inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
+	if (!prior_packets || tp->early_retrans_delayed)
+		tcp_rearm_rto(sk);
 }
 
 /* SND.NXT, if window was not shrunk.
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 34d4a02..e911e6c 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -319,6 +319,11 @@ void tcp_retransmit_timer(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
 
+	if (tp->early_retrans_delayed) {
+		tcp_resume_early_retransmit(sk);
+		return;
+	}
+
 	if (!tp->packets_out)
 		goto out;
 
-- 
1.7.7.3

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 1/3] tcp: early retransmit: tcp_enter_recovery()
  2012-05-02 23:30 [PATCH v4 1/3] tcp: early retransmit: tcp_enter_recovery() Yuchung Cheng
  2012-05-02 23:30 ` [PATCH v4 2/3] tcp: early retransmit Yuchung Cheng
  2012-05-02 23:30 ` [PATCH v4 3/3] tcp: early retransmit: delayed fast retransmit Yuchung Cheng
@ 2012-05-03  0:13 ` Neal Cardwell
  2012-05-03  1:56   ` Eric Dumazet
  2012-05-03  0:56 ` David Miller
  3 siblings, 1 reply; 10+ messages in thread
From: Neal Cardwell @ 2012-05-03  0:13 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: davem, ilpo.jarvinen, nanditad, netdev

On Wed, May 2, 2012 at 7:30 PM, Yuchung Cheng <ycheng@google.com> wrote:
> This a prepartion patch that refactors the code to enter recovery
> into a new function tcp_enter_recovery(). It's needed to implement
> the delayed fast retransmit in ER.
>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> ---
> ChangeLog since v1:
>  - swaped with part 1 and part2
> ChangeLog since v2:
>  - removed RFC in commit message
> ChangeLog since v3:
>  - nothing

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 2/3] tcp: early retransmit
  2012-05-02 23:30 ` [PATCH v4 2/3] tcp: early retransmit Yuchung Cheng
@ 2012-05-03  0:14   ` Neal Cardwell
  2012-05-03  0:56   ` David Miller
  1 sibling, 0 replies; 10+ messages in thread
From: Neal Cardwell @ 2012-05-03  0:14 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: davem, ilpo.jarvinen, nanditad, netdev

On Wed, May 2, 2012 at 7:30 PM, Yuchung Cheng <ycheng@google.com> wrote:
> This patch implements RFC 5827 early retransmit (ER) for TCP.
> It reduces DUPACK threshold (dupthresh) if outstanding packets are
> less than 4 to recover losses by fast recovery instead of timeout.
>
> While the algorithm is simple, small but frequent network reordering
> makes this feature dangerous: the connection repeatedly enter
> false recovery and degrade performance. Therefore we implement
> a mitigation suggested in the appendix of the RFC that delays
> entering fast recovery by a small interval, i.e., RTT/4. Currently
> ER is conservative and is disabled for the rest of the connection
> after the first reordering event. A large scale web server
> experiment on the performance impact of ER is summarized in
> section 6 of the paper "Proportional Rate Reduction for TCP”,
> IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf
>
> Note that Linux has a similar feature called THIN_DUPACK. The
> differences are THIN_DUPACK do not mitigate reorderings and is only
> used after slow start. Currently ER is disabled if THIN_DUPACK is
> enabled. I would be happy to merge THIN_DUPACK feature with ER if
> people think it's a good idea.
>
> ER is enabled by sysctl_tcp_early_retrans:
>  0: Disables ER
>
>  1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.
>
>  2: (Default) reduce dupthresh like mode 1. In addition, delay
>     entering fast recovery by RTT/4.
>
> Note: mode 2 is implemented in the third part of this patch series.
>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> ---
> ChangeLog in v2:
>  - swapped part1 and part2
> ChangeLog in v3:
>  - trivial text fixes in documentation.
> ChangeLog in v4:
>  - disables ER if any reordering. shuffle early retransmit vars in tcp_sock.

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 3/3] tcp: early retransmit: delayed fast retransmit
  2012-05-02 23:30 ` [PATCH v4 3/3] tcp: early retransmit: delayed fast retransmit Yuchung Cheng
@ 2012-05-03  0:15   ` Neal Cardwell
  2012-05-03  0:56   ` David Miller
  1 sibling, 0 replies; 10+ messages in thread
From: Neal Cardwell @ 2012-05-03  0:15 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: davem, ilpo.jarvinen, nanditad, netdev

On Wed, May 2, 2012 at 7:30 PM, Yuchung Cheng <ycheng@google.com> wrote:
> Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
> Delays the fast retransmit by an interval of RTT/4. We borrow the
> RTO timer to implement the delay. If we receive another ACK or send
> a new packet, the timer is cancelled and restored to original RTO
> value offset by time elapsed.  When the delayed-ER timer fires,
> we enter fast recovery and perform fast retransmit.
>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> ---
> ChangeLog in v2:
>  - Set sysctl_tcp_early_retrans default to 2
> ChangeLog in v3:
>  - use separate u8 for early retrans stats in tcp_sock
>  - disable ER if detects any reordering
> ChangeLog in v4:
>  - move disabling ER code on reordering into part2

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 1/3] tcp: early retransmit: tcp_enter_recovery()
  2012-05-02 23:30 [PATCH v4 1/3] tcp: early retransmit: tcp_enter_recovery() Yuchung Cheng
                   ` (2 preceding siblings ...)
  2012-05-03  0:13 ` [PATCH v4 1/3] tcp: early retransmit: tcp_enter_recovery() Neal Cardwell
@ 2012-05-03  0:56 ` David Miller
  3 siblings, 0 replies; 10+ messages in thread
From: David Miller @ 2012-05-03  0:56 UTC (permalink / raw)
  To: ycheng; +Cc: ilpo.jarvinen, ncardwell, nanditad, netdev

From: Yuchung Cheng <ycheng@google.com>
Date: Wed,  2 May 2012 16:30:02 -0700

> This a prepartion patch that refactors the code to enter recovery
> into a new function tcp_enter_recovery(). It's needed to implement
> the delayed fast retransmit in ER.
> 
> Signed-off-by: Yuchung Cheng <ycheng@google.com>

Applied.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 2/3] tcp: early retransmit
  2012-05-02 23:30 ` [PATCH v4 2/3] tcp: early retransmit Yuchung Cheng
  2012-05-03  0:14   ` Neal Cardwell
@ 2012-05-03  0:56   ` David Miller
  1 sibling, 0 replies; 10+ messages in thread
From: David Miller @ 2012-05-03  0:56 UTC (permalink / raw)
  To: ycheng; +Cc: ilpo.jarvinen, ncardwell, nanditad, netdev

From: Yuchung Cheng <ycheng@google.com>
Date: Wed,  2 May 2012 16:30:03 -0700

> This patch implements RFC 5827 early retransmit (ER) for TCP.
> It reduces DUPACK threshold (dupthresh) if outstanding packets are
> less than 4 to recover losses by fast recovery instead of timeout.
 ...
> Signed-off-by: Yuchung Cheng <ycheng@google.com>

Applied.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 3/3] tcp: early retransmit: delayed fast retransmit
  2012-05-02 23:30 ` [PATCH v4 3/3] tcp: early retransmit: delayed fast retransmit Yuchung Cheng
  2012-05-03  0:15   ` Neal Cardwell
@ 2012-05-03  0:56   ` David Miller
  1 sibling, 0 replies; 10+ messages in thread
From: David Miller @ 2012-05-03  0:56 UTC (permalink / raw)
  To: ycheng; +Cc: ilpo.jarvinen, ncardwell, nanditad, netdev

From: Yuchung Cheng <ycheng@google.com>
Date: Wed,  2 May 2012 16:30:04 -0700

> Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
> Delays the fast retransmit by an interval of RTT/4. We borrow the
> RTO timer to implement the delay. If we receive another ACK or send
> a new packet, the timer is cancelled and restored to original RTO
> value offset by time elapsed.  When the delayed-ER timer fires,
> we enter fast recovery and perform fast retransmit.
> 
> Signed-off-by: Yuchung Cheng <ycheng@google.com>

Applied.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 1/3] tcp: early retransmit: tcp_enter_recovery()
  2012-05-03  0:13 ` [PATCH v4 1/3] tcp: early retransmit: tcp_enter_recovery() Neal Cardwell
@ 2012-05-03  1:56   ` Eric Dumazet
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Dumazet @ 2012-05-03  1:56 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: Yuchung Cheng, davem, ilpo.jarvinen, nanditad, netdev

On Wed, 2012-05-02 at 20:13 -0400, Neal Cardwell wrote:
> On Wed, May 2, 2012 at 7:30 PM, Yuchung Cheng <ycheng@google.com> wrote:
> > This a prepartion patch that refactors the code to enter recovery
> > into a new function tcp_enter_recovery(). It's needed to implement
> > the delayed fast retransmit in ER.
> >
> > Signed-off-by: Yuchung Cheng <ycheng@google.com>
> > ---
> > ChangeLog since v1:
> >  - swaped with part 1 and part2
> > ChangeLog since v2:
> >  - removed RFC in commit message
> > ChangeLog since v3:
> >  - nothing
> 
> Acked-by: Neal Cardwell <ncardwell@google.com>

Yuchung, you should add our Acked-by to new submissions if no change on
a particular part was done, or else final inclusion wont have them

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-05-03  1:57 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-05-02 23:30 [PATCH v4 1/3] tcp: early retransmit: tcp_enter_recovery() Yuchung Cheng
2012-05-02 23:30 ` [PATCH v4 2/3] tcp: early retransmit Yuchung Cheng
2012-05-03  0:14   ` Neal Cardwell
2012-05-03  0:56   ` David Miller
2012-05-02 23:30 ` [PATCH v4 3/3] tcp: early retransmit: delayed fast retransmit Yuchung Cheng
2012-05-03  0:15   ` Neal Cardwell
2012-05-03  0:56   ` David Miller
2012-05-03  0:13 ` [PATCH v4 1/3] tcp: early retransmit: tcp_enter_recovery() Neal Cardwell
2012-05-03  1:56   ` Eric Dumazet
2012-05-03  0:56 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).