* [PATCH] Proportional Rate Reduction for TCP.
@ 2011-08-12  7:29 Nandita Dukkipati
  2011-08-12 11:36 ` Ilpo Järvinen
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Nandita Dukkipati @ 2011-08-12  7:29 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Tom Herbert, Yuchung Cheng, Matt Mathis,
	Nandita Dukkipati
This patch implements Proportional Rate Reduction (PRR) for TCP.
PRR is an algorithm that determines TCP's sending rate in fast
recovery. PRR avoids excessive window reductions and aims for
the actual congestion window size at the end of recovery to be as
close as possible to the window determined by the congestion control
algorithm. PRR also improves accuracy of the amount of data sent
during loss recovery.
The patch implements the recommended flavor of PRR called PRR-SSRB
(Proportional rate reduction with slow start reduction bound) and
replaces the existing rate halving algorithm. PRR improves upon the
existing Linux fast recovery under a number of conditions including:
  1) burst losses where the losses implicitly reduce the amount of
outstanding data (pipe) below the ssthresh value selected by the
congestion control algorithm and,
  2) losses near the end of short flows where application runs out of
data to send.
As an example, with the existing rate halving implementation a single
loss event can cause a connection carrying short Web transactions to
go into the slow start mode after the recovery. This is because during
recovery Linux pulls the congestion window down to packets_in_flight+1
on every ACK. A short Web response often runs out of new data to send
and its pipe reduces to zero by the end of recovery when all its packets
are drained from the network. Subsequent HTTP responses using the same
connection will have to slow start to raise cwnd to ssthresh. PRR on
the other hand aims for the cwnd to be as close as possible to ssthresh
by the end of recovery.
A description of PRR and a discussion of its performance can be found at
the following links:
- IETF Draft:
    http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01
- IETF Slides:
    http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf
    http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf
- Paper to appear in Internet Measurements Conference (IMC) 2011:
    Improving TCP Loss Recovery
    Nandita Dukkipati, Matt Mathis, Yuchung Cheng
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
---
 include/linux/tcp.h   |    4 +++
 net/ipv4/tcp_input.c  |   63 ++++++++++++++++++++++++++++++++++++++++++++----
 net/ipv4/tcp_output.c |    7 ++++-
 3 files changed, 67 insertions(+), 7 deletions(-)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 531ede8..dda4f2e1 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -379,6 +379,10 @@ struct tcp_sock {
 	u32	snd_cwnd_clamp; /* Do not allow snd_cwnd to grow above this */
 	u32	snd_cwnd_used;
 	u32	snd_cwnd_stamp;
+	u32	prr_cwnd;	/* Congestion window at start of Recovery. */
+	u32	prr_delivered;	/* Number of newly delivered packets to
+				 * receiver in Recovery. */
+	u32	prr_out;	/* Total number of pkts sent during Recovery. */
 
  	u32	rcv_wnd;	/* Current receiver window		*/
 	u32	write_seq;	/* Tail(+1) of data held in tcp send buffer */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ea0d218..601eff0 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2830,9 +2830,14 @@ static int tcp_try_undo_loss(struct sock *sk)
 static inline void tcp_complete_cwr(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	/* Do not moderate cwnd if it's already undone in cwr or recovery */
-	if (tp->undo_marker && tp->snd_cwnd > tp->snd_ssthresh) {
-		tp->snd_cwnd = tp->snd_ssthresh;
+
+	/* Do not moderate cwnd if it's already undone in cwr or recovery. */
+	if (tp->undo_marker) {
+
+		if (inet_csk(sk)->icsk_ca_state == TCP_CA_CWR)
+			tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh);
+		else /* PRR */
+			tp->snd_cwnd = tp->snd_ssthresh;
 		tp->snd_cwnd_stamp = tcp_time_stamp;
 	}
 	tcp_ca_event(sk, CA_EVENT_COMPLETE_CWR);
@@ -2950,6 +2955,39 @@ void tcp_simple_retransmit(struct sock *sk)
 }
 EXPORT_SYMBOL(tcp_simple_retransmit);
 
+/* This function implements the PRR algorithm, specifcally the PRR-SSRB
+ * (proportional rate reduction with slow start reduction bound) as described in
+ * http://www.ietf.org/id/draft-mathis-tcpm-proportional-rate-reduction-01.txt.
+ * It computes the number of packets to send (sndcnt) based on packets newly
+ * delivered:
+ *   1) If the packets in flight is larger than ssthresh, PRR spreads the
+ *	cwnd reductions across a full RTT.
+ *   2) If packets in flight is lower than ssthresh (such as due to excess
+ *	losses and/or application stalls), do not perform any further cwnd
+ *	reductions, but instead slow start up to ssthresh.
+ */
+static void tcp_update_cwnd_in_recovery(struct sock *sk, int pkts_delivered,
+					int fast_rexmit, int flag)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	int sndcnt = 0;
+	int delta = tp->snd_ssthresh - tcp_packets_in_flight(tp);
+
+	if (tcp_packets_in_flight(tp) > tp->snd_ssthresh) {
+		if (WARN_ON(!tp->prr_cwnd))
+			tp->prr_cwnd = 1;
+		sndcnt = DIV_ROUND_UP(tp->prr_delivered * tp->snd_ssthresh,
+				      tp->prr_cwnd) - tp->prr_out;
+	} else {
+		sndcnt = min_t(int, delta,
+			       max_t(int, tp->prr_delivered - tp->prr_out,
+				     pkts_delivered) + 1);
+	}
+
+	sndcnt = max(sndcnt, (fast_rexmit ? 1 : 0));
+	tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
+}
+
 /* Process an event, which can update packets-in-flight not trivially.
  * Main goal of this function is to calculate new estimate for left_out,
  * taking into account both packets sitting in receiver's buffer and
@@ -2961,7 +2999,8 @@ EXPORT_SYMBOL(tcp_simple_retransmit);
  * It does _not_ decide what to send, it is made in function
  * tcp_xmit_retransmit_queue().
  */
-static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
+static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
+				  int pkts_delivered, int flag)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -3111,13 +3150,17 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
 
 		tp->bytes_acked = 0;
 		tp->snd_cwnd_cnt = 0;
+		tp->prr_cwnd = tp->snd_cwnd;
+		tp->prr_delivered = 0;
+		tp->prr_out = 0;
 		tcp_set_ca_state(sk, TCP_CA_Recovery);
 		fast_rexmit = 1;
 	}
 
 	if (do_lost || (tcp_is_fack(tp) && tcp_head_timedout(sk)))
 		tcp_update_scoreboard(sk, fast_rexmit);
-	tcp_cwnd_down(sk, flag);
+	tp->prr_delivered += pkts_delivered;
+	tcp_update_cwnd_in_recovery(sk, pkts_delivered, fast_rexmit, flag);
 	tcp_xmit_retransmit_queue(sk);
 }
 
@@ -3632,6 +3675,11 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
 	u32 prior_in_flight;
 	u32 prior_fackets;
 	int prior_packets;
+	int prior_sacked = tp->sacked_out;
+	/* pkts_delivered is number of packets newly cumulatively acked or
+	 * sacked on this ACK.
+	 */
+	int pkts_delivered = 0;
 	int frto_cwnd = 0;
 
 	/* If the ack is older than previous acks
@@ -3703,6 +3751,9 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
 	/* See if we can take anything off of the retransmit queue. */
 	flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una);
 
+	pkts_delivered = (prior_packets - prior_sacked) -
+			 (tp->packets_out - tp->sacked_out);
+
 	if (tp->frto_counter)
 		frto_cwnd = tcp_process_frto(sk, flag);
 	/* Guarantee sacktag reordering detection against wrap-arounds */
@@ -3715,7 +3766,7 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
 		    tcp_may_raise_cwnd(sk, flag))
 			tcp_cong_avoid(sk, ack, prior_in_flight);
 		tcp_fastretrans_alert(sk, prior_packets - tp->packets_out,
-				      flag);
+				      pkts_delivered, flag);
 	} else {
 		if ((flag & FLAG_DATA_ACKED) && !frto_cwnd)
 			tcp_cong_avoid(sk, ack, prior_in_flight);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 882e0b0..ca50408 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1796,11 +1796,13 @@ static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		tcp_event_new_data_sent(sk, skb);
 
 		tcp_minshall_update(tp, mss_now, skb);
-		sent_pkts++;
+		sent_pkts += tcp_skb_pcount(skb);
 
 		if (push_one)
 			break;
 	}
+	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Recovery)
+		tp->prr_out += sent_pkts;
 
 	if (likely(sent_pkts)) {
 		tcp_cwnd_validate(sk);
@@ -2294,6 +2296,9 @@ begin_fwd:
 			return;
 		NET_INC_STATS_BH(sock_net(sk), mib_idx);
 
+		if (inet_csk(sk)->icsk_ca_state == TCP_CA_Recovery)
+			tp->prr_out += tcp_skb_pcount(skb);
+
 		if (skb == tcp_write_queue_head(sk))
 			inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
 						  inet_csk(sk)->icsk_rto,
-- 
1.7.3.1
^ permalink raw reply related	[flat|nested] 15+ messages in thread
* Re: [PATCH] Proportional Rate Reduction for TCP.
  2011-08-12  7:29 [PATCH] Proportional Rate Reduction for TCP Nandita Dukkipati
@ 2011-08-12 11:36 ` Ilpo Järvinen
  2011-08-19  7:34   ` Nandita Dukkipati
  2011-08-14  5:05 ` Andi Kleen
  2011-08-19  7:33 ` [PATCH v2] " Nandita Dukkipati
  2 siblings, 1 reply; 15+ messages in thread
From: Ilpo Järvinen @ 2011-08-12 11:36 UTC (permalink / raw)
  To: Nandita Dukkipati
  Cc: David S. Miller, Netdev, Tom Herbert, Yuchung Cheng, Matt Mathis
> This patch implements Proportional Rate Reduction (PRR) for TCP.
> PRR is an algorithm that determines TCP's sending rate in fast
> recovery. PRR avoids excessive window reductions and aims for
> the actual congestion window size at the end of recovery to be as
> close as possible to the window determined by the congestion control
> algorithm. PRR also improves accuracy of the amount of data sent
> during loss recovery.
> 
> The patch implements the recommended flavor of PRR called PRR-SSRB
> (Proportional rate reduction with slow start reduction bound) and
> replaces the existing rate halving algorithm. PRR improves upon the
> existing Linux fast recovery under a number of conditions including:
>   1) burst losses where the losses implicitly reduce the amount of
> outstanding data (pipe) below the ssthresh value selected by the
> congestion control algorithm and,
>   2) losses near the end of short flows where application runs out of
> data to send.
> 
> As an example, with the existing rate halving implementation a single
> loss event can cause a connection carrying short Web transactions to
> go into the slow start mode after the recovery. This is because during
> recovery Linux pulls the congestion window down to packets_in_flight+1
> on every ACK. A short Web response often runs out of new data to send
> and its pipe reduces to zero by the end of recovery when all its packets
> are drained from the network. Subsequent HTTP responses using the same
> connection will have to slow start to raise cwnd to ssthresh. PRR on
> the other hand aims for the cwnd to be as close as possible to ssthresh
> by the end of recovery.
> 
> A description of PRR and a discussion of its performance can be found at
> the following links:
> - IETF Draft:
>     http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01
> - IETF Slides:
>     http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf
>     http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf
> - Paper to appear in Internet Measurements Conference (IMC) 2011:
>     Improving TCP Loss Recovery
>     Nandita Dukkipati, Matt Mathis, Yuchung Cheng
> 
> Signed-off-by: Nandita Dukkipati <nanditad@google.com>
> ---
>  include/linux/tcp.h   |    4 +++
>  net/ipv4/tcp_input.c  |   63 ++++++++++++++++++++++++++++++++++++++++++++----
>  net/ipv4/tcp_output.c |    7 ++++-
>  3 files changed, 67 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 531ede8..dda4f2e1 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -379,6 +379,10 @@ struct tcp_sock {
>  	u32	snd_cwnd_clamp; /* Do not allow snd_cwnd to grow above this */
>  	u32	snd_cwnd_used;
>  	u32	snd_cwnd_stamp;
> +	u32	prr_cwnd;	/* Congestion window at start of Recovery. */
> +	u32	prr_delivered;	/* Number of newly delivered packets to
> +				 * receiver in Recovery. */
> +	u32	prr_out;	/* Total number of pkts sent during Recovery. */
>  
>   	u32	rcv_wnd;	/* Current receiver window		*/
>  	u32	write_seq;	/* Tail(+1) of data held in tcp send buffer */
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index ea0d218..601eff0 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2830,9 +2830,14 @@ static int tcp_try_undo_loss(struct sock *sk)
>  static inline void tcp_complete_cwr(struct sock *sk)
>  {
>  	struct tcp_sock *tp = tcp_sk(sk);
> -	/* Do not moderate cwnd if it's already undone in cwr or recovery */
> -	if (tp->undo_marker && tp->snd_cwnd > tp->snd_ssthresh) {
> -		tp->snd_cwnd = tp->snd_ssthresh;
> +
> +	/* Do not moderate cwnd if it's already undone in cwr or recovery. */
> +	if (tp->undo_marker) {
> +
> +		if (inet_csk(sk)->icsk_ca_state == TCP_CA_CWR)
> +			tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh);
> +		else /* PRR */
> +			tp->snd_cwnd = tp->snd_ssthresh;
>  		tp->snd_cwnd_stamp = tcp_time_stamp;
>  	}
>  	tcp_ca_event(sk, CA_EVENT_COMPLETE_CWR);
> @@ -2950,6 +2955,39 @@ void tcp_simple_retransmit(struct sock *sk)
>  }
>  EXPORT_SYMBOL(tcp_simple_retransmit);
>  
> +/* This function implements the PRR algorithm, specifcally the PRR-SSRB
> + * (proportional rate reduction with slow start reduction bound) as described in
> + * http://www.ietf.org/id/draft-mathis-tcpm-proportional-rate-reduction-01.txt.
> + * It computes the number of packets to send (sndcnt) based on packets newly
> + * delivered:
> + *   1) If the packets in flight is larger than ssthresh, PRR spreads the
> + *	cwnd reductions across a full RTT.
> + *   2) If packets in flight is lower than ssthresh (such as due to excess
> + *	losses and/or application stalls), do not perform any further cwnd
> + *	reductions, but instead slow start up to ssthresh.
> + */
> +static void tcp_update_cwnd_in_recovery(struct sock *sk, int pkts_delivered,
> +					int fast_rexmit, int flag)
> +{
> +	struct tcp_sock *tp = tcp_sk(sk);
> +	int sndcnt = 0;
> +	int delta = tp->snd_ssthresh - tcp_packets_in_flight(tp);
> +
> +	if (tcp_packets_in_flight(tp) > tp->snd_ssthresh) {
> +		if (WARN_ON(!tp->prr_cwnd))
> +			tp->prr_cwnd = 1;
Thanks, this made me to smile when I realized what kind of positive feedback
loop it would cause in a heavily congested environment :-). ...Perhaps any
value below 2 * tp->snd_ssthresh isn't that safe safety relief valve here.
...Of course this condition isn't ever hit with the current code base so 
no real harm being done regardless of the value.
> +		sndcnt = DIV_ROUND_UP(tp->prr_delivered * tp->snd_ssthresh,
> +				      tp->prr_cwnd) - tp->prr_out;
What about very large windows ...overflow?
Due to not having lower bound here, the resulting window can end up below
snd_ssthresh in one step if prr_delivered suddently jumps to a value that
is above prr_cnwd?
Also, snd_ssthresh and prr_cwnd are essentially constants over the
whole recovery and yet divide is needed whole the time. ...And in practically
all cases the proportion will be 0.5 (except for the odd number produced
off-by-one thing)?
> +	} else {
> +		sndcnt = min_t(int, delta,
> +			       max_t(int, tp->prr_delivered - tp->prr_out,
> +				     pkts_delivered) + 1);
> +	}
> +
> +	sndcnt = max(sndcnt, (fast_rexmit ? 1 : 0));
> +	tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
> +}
> +
>  /* Process an event, which can update packets-in-flight not trivially.
>   * Main goal of this function is to calculate new estimate for left_out,
>   * taking into account both packets sitting in receiver's buffer and
> @@ -2961,7 +2999,8 @@ EXPORT_SYMBOL(tcp_simple_retransmit);
>   * It does _not_ decide what to send, it is made in function
>   * tcp_xmit_retransmit_queue().
>   */
> -static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
> +static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
> +				  int pkts_delivered, int flag)
>  {
>  	struct inet_connection_sock *icsk = inet_csk(sk);
>  	struct tcp_sock *tp = tcp_sk(sk);
> @@ -3111,13 +3150,17 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
>  
>  		tp->bytes_acked = 0;
>  		tp->snd_cwnd_cnt = 0;
> +		tp->prr_cwnd = tp->snd_cwnd;
prior_cwnd would make this variable about 1000 times easier to read
without requiring some background.
While at it, I realized that there is some restore window place in the 
generic code that assumes "prior_cwnd == 2 * tp->snd_ssthresh" which is 
not true with every single cc module.
> +		tp->prr_delivered = 0;
> +		tp->prr_out = 0;
>  		tcp_set_ca_state(sk, TCP_CA_Recovery);
>  		fast_rexmit = 1;
>  	}
>  
>  	if (do_lost || (tcp_is_fack(tp) && tcp_head_timedout(sk)))
>  		tcp_update_scoreboard(sk, fast_rexmit);
> -	tcp_cwnd_down(sk, flag);
> +	tp->prr_delivered += pkts_delivered;
...Is here a bug in the calculation if the recovery was triggered with
_a cumulative ACK_ that had enough SACK blocks?
> +	tcp_update_cwnd_in_recovery(sk, pkts_delivered, fast_rexmit, flag);
>  	tcp_xmit_retransmit_queue(sk);
>  }
>  
> @@ -3632,6 +3675,11 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
>  	u32 prior_in_flight;
>  	u32 prior_fackets;
>  	int prior_packets;
> +	int prior_sacked = tp->sacked_out;
> +	/* pkts_delivered is number of packets newly cumulatively acked or
> +	 * sacked on this ACK.
> +	 */
> +	int pkts_delivered = 0;
If you need a comment like that on variable declaration, maybe its name
could be more obvious, pkts_acked_sacked ?
>  	int frto_cwnd = 0;
>  
>  	/* If the ack is older than previous acks
> @@ -3703,6 +3751,9 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
>  	/* See if we can take anything off of the retransmit queue. */
>  	flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una);
> +	pkts_delivered = (prior_packets - prior_sacked) -
> +			 (tp->packets_out - tp->sacked_out);
> +
>  	if (tp->frto_counter)
>  		frto_cwnd = tcp_process_frto(sk, flag);
>  	/* Guarantee sacktag reordering detection against wrap-arounds */
> @@ -3715,7 +3766,7 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
>  		    tcp_may_raise_cwnd(sk, flag))
>  			tcp_cong_avoid(sk, ack, prior_in_flight);
>  		tcp_fastretrans_alert(sk, prior_packets - tp->packets_out,
> -				      flag);
> +				      pkts_delivered, flag);
>  	} else {
>  		if ((flag & FLAG_DATA_ACKED) && !frto_cwnd)
>  			tcp_cong_avoid(sk, ack, prior_in_flight);
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 882e0b0..ca50408 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1796,11 +1796,13 @@ static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>  		tcp_event_new_data_sent(sk, skb);
>  
>  		tcp_minshall_update(tp, mss_now, skb);
> -		sent_pkts++;
> +		sent_pkts += tcp_skb_pcount(skb);
>  
>  		if (push_one)
>  			break;
>  	}
> +	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Recovery)
> +		tp->prr_out += sent_pkts;
Is there some harm done if you get rid of the conditionality?
>  	if (likely(sent_pkts)) {
>  		tcp_cwnd_validate(sk);
> @@ -2294,6 +2296,9 @@ begin_fwd:
>  			return;
>  		NET_INC_STATS_BH(sock_net(sk), mib_idx);
>  
> +		if (inet_csk(sk)->icsk_ca_state == TCP_CA_Recovery)
> +			tp->prr_out += tcp_skb_pcount(skb);
> +
...same here?
>  		if (skb == tcp_write_queue_head(sk))
>  			inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
>  						  inet_csk(sk)->icsk_rto,
Besides the points above, I'm not convinced that we really need all the
across-packet state that is present, some relevant values should be
available before we do any ACK processing (in-flight, cwnd, and more
importantly, the difference between them). ...But I haven't yet
convinced myself of anything particular.
-- 
 i.
^ permalink raw reply	[flat|nested] 15+ messages in thread
* Re: [PATCH] Proportional Rate Reduction for TCP.
  2011-08-12  7:29 [PATCH] Proportional Rate Reduction for TCP Nandita Dukkipati
  2011-08-12 11:36 ` Ilpo Järvinen
@ 2011-08-14  5:05 ` Andi Kleen
  2011-08-19  7:34   ` Nandita Dukkipati
  2011-08-19  7:33 ` [PATCH v2] " Nandita Dukkipati
  2 siblings, 1 reply; 15+ messages in thread
From: Andi Kleen @ 2011-08-14  5:05 UTC (permalink / raw)
  To: Nandita Dukkipati
  Cc: David S. Miller, netdev, Tom Herbert, Yuchung Cheng, Matt Mathis
Nandita Dukkipati <nanditad@google.com> writes:
> +
> +	if (tcp_packets_in_flight(tp) > tp->snd_ssthresh) {
> +		if (WARN_ON(!tp->prr_cwnd))
> +			tp->prr_cwnd = 1;
> +		sndcnt = DIV_ROUND_UP(tp->prr_delivered * tp->snd_ssthresh,
> +				      tp->prr_cwnd) - tp->prr_out;
> +	} else {
> +		sndcnt = min_t(int, delta,
> +			       max_t(int, tp->prr_delivered -
> tp->prr_out,
u32s here? This will likely do bad things with large enough windows.
The rest looks good to me as code, but I don't claim to fully understand the 
algorithm. Perhaps a sysctl to turn it off and a linux mib counter when 
it triggers would be useful in addition.
-Andi
-- 
ak@linux.intel.com -- Speaking for myself only
^ permalink raw reply	[flat|nested] 15+ messages in thread
* [PATCH v2] Proportional Rate Reduction for TCP.
  2011-08-12  7:29 [PATCH] Proportional Rate Reduction for TCP Nandita Dukkipati
  2011-08-12 11:36 ` Ilpo Järvinen
  2011-08-14  5:05 ` Andi Kleen
@ 2011-08-19  7:33 ` Nandita Dukkipati
  2011-08-19 10:25   ` Ilpo Järvinen
                     ` (2 more replies)
  2 siblings, 3 replies; 15+ messages in thread
From: Nandita Dukkipati @ 2011-08-19  7:33 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Tom Herbert, Matt Mathis, Yuchung Cheng,
	Nandita Dukkipati
This patch implements Proportional Rate Reduction (PRR) for TCP.
PRR is an algorithm that determines TCP's sending rate in fast
recovery. PRR avoids excessive window reductions and aims for
the actual congestion window size at the end of recovery to be as
close as possible to the window determined by the congestion control
algorithm. PRR also improves accuracy of the amount of data sent
during loss recovery.
The patch implements the recommended flavor of PRR called PRR-SSRB
(Proportional rate reduction with slow start reduction bound) and
replaces the existing rate halving algorithm. PRR improves upon the
existing Linux fast recovery under a number of conditions including:
  1) burst losses where the losses implicitly reduce the amount of
outstanding data (pipe) below the ssthresh value selected by the
congestion control algorithm and,
  2) losses near the end of short flows where application runs out of
data to send.
As an example, with the existing rate halving implementation a single
loss event can cause a connection carrying short Web transactions to
go into the slow start mode after the recovery. This is because during
recovery Linux pulls the congestion window down to packets_in_flight+1
on every ACK. A short Web response often runs out of new data to send
and its pipe reduces to zero by the end of recovery when all its packets
are drained from the network. Subsequent HTTP responses using the same
connection will have to slow start to raise cwnd to ssthresh. PRR on
the other hand aims for the cwnd to be as close as possible to ssthresh
by the end of recovery.
A description of PRR and a discussion of its performance can be found at
the following links:
- IETF Draft:
    http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01
- IETF Slides:
    http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf
    http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf
- Paper to appear in Internet Measurements Conference (IMC) 2011:
    Improving TCP Loss Recovery
    Nandita Dukkipati, Matt Mathis, Yuchung Cheng
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
---
Changelog since v1:
- Took care of overflow for large congestion windows in tcp_update_cwnd_in_recovery().
- Renamed prr_cwnd to prior_cwnd.
- Renamed pkts_delivered to newly_acked_sacked.
 include/linux/tcp.h   |    4 +++
 net/ipv4/tcp_input.c  |   61 ++++++++++++++++++++++++++++++++++++++++++++-----
 net/ipv4/tcp_output.c |    7 +++++-
 3 files changed, 65 insertions(+), 7 deletions(-)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 531ede8..6b63b31 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -379,6 +379,10 @@ struct tcp_sock {
 	u32	snd_cwnd_clamp; /* Do not allow snd_cwnd to grow above this */
 	u32	snd_cwnd_used;
 	u32	snd_cwnd_stamp;
+	u32	prior_cwnd;	/* Congestion window at start of Recovery. */
+	u32	prr_delivered;	/* Number of newly delivered packets to
+				 * receiver in Recovery. */
+	u32	prr_out;	/* Total number of pkts sent during Recovery. */
 
  	u32	rcv_wnd;	/* Current receiver window		*/
 	u32	write_seq;	/* Tail(+1) of data held in tcp send buffer */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ea0d218..41e5f43 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2830,9 +2830,14 @@ static int tcp_try_undo_loss(struct sock *sk)
 static inline void tcp_complete_cwr(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	/* Do not moderate cwnd if it's already undone in cwr or recovery */
-	if (tp->undo_marker && tp->snd_cwnd > tp->snd_ssthresh) {
-		tp->snd_cwnd = tp->snd_ssthresh;
+
+	/* Do not moderate cwnd if it's already undone in cwr or recovery. */
+	if (tp->undo_marker) {
+
+		if (inet_csk(sk)->icsk_ca_state == TCP_CA_CWR)
+			tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh);
+		else /* PRR */
+			tp->snd_cwnd = tp->snd_ssthresh;
 		tp->snd_cwnd_stamp = tcp_time_stamp;
 	}
 	tcp_ca_event(sk, CA_EVENT_COMPLETE_CWR);
@@ -2950,6 +2955,40 @@ void tcp_simple_retransmit(struct sock *sk)
 }
 EXPORT_SYMBOL(tcp_simple_retransmit);
 
+/* This function implements the PRR algorithm, specifcally the PRR-SSRB
+ * (proportional rate reduction with slow start reduction bound) as described in
+ * http://www.ietf.org/id/draft-mathis-tcpm-proportional-rate-reduction-01.txt.
+ * It computes the number of packets to send (sndcnt) based on packets newly
+ * delivered:
+ *   1) If the packets in flight is larger than ssthresh, PRR spreads the
+ *	cwnd reductions across a full RTT.
+ *   2) If packets in flight is lower than ssthresh (such as due to excess
+ *	losses and/or application stalls), do not perform any further cwnd
+ *	reductions, but instead slow start up to ssthresh.
+ */
+static void tcp_update_cwnd_in_recovery(struct sock *sk, int newly_acked_sacked,
+					int fast_rexmit, int flag)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	int sndcnt = 0;
+	int delta = tp->snd_ssthresh - tcp_packets_in_flight(tp);
+
+	if (tcp_packets_in_flight(tp) > tp->snd_ssthresh) {
+		if (WARN_ON(!tp->prior_cwnd))
+			tp->prior_cwnd = 1;
+		sndcnt = DIV_ROUND_UP((u64)(tp->prr_delivered *
+					    tp->snd_ssthresh),
+				      (u64)tp->prior_cwnd) - tp->prr_out;
+	} else {
+		sndcnt = min_t(int, delta,
+			       max_t(int, tp->prr_delivered - tp->prr_out,
+				     newly_acked_sacked) + 1);
+	}
+
+	sndcnt = max(sndcnt, (fast_rexmit ? 1 : 0));
+	tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
+}
+
 /* Process an event, which can update packets-in-flight not trivially.
  * Main goal of this function is to calculate new estimate for left_out,
  * taking into account both packets sitting in receiver's buffer and
@@ -2961,7 +3000,8 @@ EXPORT_SYMBOL(tcp_simple_retransmit);
  * It does _not_ decide what to send, it is made in function
  * tcp_xmit_retransmit_queue().
  */
-static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
+static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
+				  int newly_acked_sacked, int flag)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -3111,13 +3151,17 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
 
 		tp->bytes_acked = 0;
 		tp->snd_cwnd_cnt = 0;
+		tp->prior_cwnd = tp->snd_cwnd;
+		tp->prr_delivered = 0;
+		tp->prr_out = 0;
 		tcp_set_ca_state(sk, TCP_CA_Recovery);
 		fast_rexmit = 1;
 	}
 
 	if (do_lost || (tcp_is_fack(tp) && tcp_head_timedout(sk)))
 		tcp_update_scoreboard(sk, fast_rexmit);
-	tcp_cwnd_down(sk, flag);
+	tp->prr_delivered += newly_acked_sacked;
+	tcp_update_cwnd_in_recovery(sk, newly_acked_sacked, fast_rexmit, flag);
 	tcp_xmit_retransmit_queue(sk);
 }
 
@@ -3632,6 +3676,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
 	u32 prior_in_flight;
 	u32 prior_fackets;
 	int prior_packets;
+	int prior_sacked = tp->sacked_out;
+	int newly_acked_sacked = 0;
 	int frto_cwnd = 0;
 
 	/* If the ack is older than previous acks
@@ -3703,6 +3749,9 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
 	/* See if we can take anything off of the retransmit queue. */
 	flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una);
 
+	newly_acked_sacked = (prior_packets - prior_sacked) -
+			     (tp->packets_out - tp->sacked_out);
+
 	if (tp->frto_counter)
 		frto_cwnd = tcp_process_frto(sk, flag);
 	/* Guarantee sacktag reordering detection against wrap-arounds */
@@ -3715,7 +3764,7 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
 		    tcp_may_raise_cwnd(sk, flag))
 			tcp_cong_avoid(sk, ack, prior_in_flight);
 		tcp_fastretrans_alert(sk, prior_packets - tp->packets_out,
-				      flag);
+				      newly_acked_sacked, flag);
 	} else {
 		if ((flag & FLAG_DATA_ACKED) && !frto_cwnd)
 			tcp_cong_avoid(sk, ack, prior_in_flight);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 882e0b0..ca50408 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1796,11 +1796,13 @@ static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		tcp_event_new_data_sent(sk, skb);
 
 		tcp_minshall_update(tp, mss_now, skb);
-		sent_pkts++;
+		sent_pkts += tcp_skb_pcount(skb);
 
 		if (push_one)
 			break;
 	}
+	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Recovery)
+		tp->prr_out += sent_pkts;
 
 	if (likely(sent_pkts)) {
 		tcp_cwnd_validate(sk);
@@ -2294,6 +2296,9 @@ begin_fwd:
 			return;
 		NET_INC_STATS_BH(sock_net(sk), mib_idx);
 
+		if (inet_csk(sk)->icsk_ca_state == TCP_CA_Recovery)
+			tp->prr_out += tcp_skb_pcount(skb);
+
 		if (skb == tcp_write_queue_head(sk))
 			inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
 						  inet_csk(sk)->icsk_rto,
-- 
1.7.3.1
^ permalink raw reply related	[flat|nested] 15+ messages in thread
* Re: [PATCH] Proportional Rate Reduction for TCP.
  2011-08-12 11:36 ` Ilpo Järvinen
@ 2011-08-19  7:34   ` Nandita Dukkipati
  0 siblings, 0 replies; 15+ messages in thread
From: Nandita Dukkipati @ 2011-08-19  7:34 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: David S. Miller, Netdev, Tom Herbert, Yuchung Cheng, Matt Mathis
On Fri, Aug 12, 2011 at 4:36 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
>
> > This patch implements Proportional Rate Reduction (PRR) for TCP.
> > PRR is an algorithm that determines TCP's sending rate in fast
> > recovery. PRR avoids excessive window reductions and aims for
> > the actual congestion window size at the end of recovery to be as
> > close as possible to the window determined by the congestion control
> > algorithm. PRR also improves accuracy of the amount of data sent
> > during loss recovery.
> >
> > The patch implements the recommended flavor of PRR called PRR-SSRB
> > (Proportional rate reduction with slow start reduction bound) and
> > replaces the existing rate halving algorithm. PRR improves upon the
> > existing Linux fast recovery under a number of conditions including:
> >   1) burst losses where the losses implicitly reduce the amount of
> > outstanding data (pipe) below the ssthresh value selected by the
> > congestion control algorithm and,
> >   2) losses near the end of short flows where application runs out of
> > data to send.
> >
> > As an example, with the existing rate halving implementation a single
> > loss event can cause a connection carrying short Web transactions to
> > go into the slow start mode after the recovery. This is because during
> > recovery Linux pulls the congestion window down to packets_in_flight+1
> > on every ACK. A short Web response often runs out of new data to send
> > and its pipe reduces to zero by the end of recovery when all its packets
> > are drained from the network. Subsequent HTTP responses using the same
> > connection will have to slow start to raise cwnd to ssthresh. PRR on
> > the other hand aims for the cwnd to be as close as possible to ssthresh
> > by the end of recovery.
> >
> > A description of PRR and a discussion of its performance can be found at
> > the following links:
> > - IETF Draft:
> >     http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01
> > - IETF Slides:
> >     http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf
> >     http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf
> > - Paper to appear in Internet Measurements Conference (IMC) 2011:
> >     Improving TCP Loss Recovery
> >     Nandita Dukkipati, Matt Mathis, Yuchung Cheng
> >
> > Signed-off-by: Nandita Dukkipati <nanditad@google.com>
> > ---
> >  include/linux/tcp.h   |    4 +++
> >  net/ipv4/tcp_input.c  |   63 ++++++++++++++++++++++++++++++++++++++++++++----
> >  net/ipv4/tcp_output.c |    7 ++++-
> >  3 files changed, 67 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> > index 531ede8..dda4f2e1 100644
> > --- a/include/linux/tcp.h
> > +++ b/include/linux/tcp.h
> > @@ -379,6 +379,10 @@ struct tcp_sock {
> >       u32     snd_cwnd_clamp; /* Do not allow snd_cwnd to grow above this */
> >       u32     snd_cwnd_used;
> >       u32     snd_cwnd_stamp;
> > +     u32     prr_cwnd;       /* Congestion window at start of Recovery. */
> > +     u32     prr_delivered;  /* Number of newly delivered packets to
> > +                              * receiver in Recovery. */
> > +     u32     prr_out;        /* Total number of pkts sent during Recovery. */
> >
> >       u32     rcv_wnd;        /* Current receiver window              */
> >       u32     write_seq;      /* Tail(+1) of data held in tcp send buffer */
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index ea0d218..601eff0 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -2830,9 +2830,14 @@ static int tcp_try_undo_loss(struct sock *sk)
> >  static inline void tcp_complete_cwr(struct sock *sk)
> >  {
> >       struct tcp_sock *tp = tcp_sk(sk);
> > -     /* Do not moderate cwnd if it's already undone in cwr or recovery */
> > -     if (tp->undo_marker && tp->snd_cwnd > tp->snd_ssthresh) {
> > -             tp->snd_cwnd = tp->snd_ssthresh;
> > +
> > +     /* Do not moderate cwnd if it's already undone in cwr or recovery. */
> > +     if (tp->undo_marker) {
> > +
> > +             if (inet_csk(sk)->icsk_ca_state == TCP_CA_CWR)
> > +                     tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh);
> > +             else /* PRR */
> > +                     tp->snd_cwnd = tp->snd_ssthresh;
> >               tp->snd_cwnd_stamp = tcp_time_stamp;
> >       }
> >       tcp_ca_event(sk, CA_EVENT_COMPLETE_CWR);
> > @@ -2950,6 +2955,39 @@ void tcp_simple_retransmit(struct sock *sk)
> >  }
> >  EXPORT_SYMBOL(tcp_simple_retransmit);
> >
> > +/* This function implements the PRR algorithm, specifcally the PRR-SSRB
> > + * (proportional rate reduction with slow start reduction bound) as described in
> > + * http://www.ietf.org/id/draft-mathis-tcpm-proportional-rate-reduction-01.txt.
> > + * It computes the number of packets to send (sndcnt) based on packets newly
> > + * delivered:
> > + *   1) If the packets in flight is larger than ssthresh, PRR spreads the
> > + *   cwnd reductions across a full RTT.
> > + *   2) If packets in flight is lower than ssthresh (such as due to excess
> > + *   losses and/or application stalls), do not perform any further cwnd
> > + *   reductions, but instead slow start up to ssthresh.
> > + */
> > +static void tcp_update_cwnd_in_recovery(struct sock *sk, int pkts_delivered,
> > +                                     int fast_rexmit, int flag)
> > +{
> > +     struct tcp_sock *tp = tcp_sk(sk);
> > +     int sndcnt = 0;
> > +     int delta = tp->snd_ssthresh - tcp_packets_in_flight(tp);
> > +
> > +     if (tcp_packets_in_flight(tp) > tp->snd_ssthresh) {
> > +             if (WARN_ON(!tp->prr_cwnd))
> > +                     tp->prr_cwnd = 1;
>
> Thanks, this made me to smile when I realized what kind of positive feedback
> loop it would cause in a heavily congested environment :-). ...Perhaps any
> value below 2 * tp->snd_ssthresh isn't that safe safety relief valve here.
> ...Of course this condition isn't ever hit with the current code base so
> no real harm being done regardless of the value.
>
> > +             sndcnt = DIV_ROUND_UP(tp->prr_delivered * tp->snd_ssthresh,
> > +                                   tp->prr_cwnd) - tp->prr_out;
>
> What about very large windows ...overflow?
Addressed in patch v2.
> Due to not having lower bound here, the resulting window can end up below
> snd_ssthresh in one step if prr_delivered suddently jumps to a value that
> is above prr_cnwd?
>
> Also, snd_ssthresh and prr_cwnd are essentially constants over the
> whole recovery and yet divide is needed whole the time. ...And in practically
> all cases the proportion will be 0.5 (except for the odd number produced
> off-by-one thing)?
>
> > +     } else {
> > +             sndcnt = min_t(int, delta,
> > +                            max_t(int, tp->prr_delivered - tp->prr_out,
> > +                                  pkts_delivered) + 1);
> > +     }
> > +
> > +     sndcnt = max(sndcnt, (fast_rexmit ? 1 : 0));
> > +     tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
> > +}
> > +
> >  /* Process an event, which can update packets-in-flight not trivially.
> >   * Main goal of this function is to calculate new estimate for left_out,
> >   * taking into account both packets sitting in receiver's buffer and
> > @@ -2961,7 +2999,8 @@ EXPORT_SYMBOL(tcp_simple_retransmit);
> >   * It does _not_ decide what to send, it is made in function
> >   * tcp_xmit_retransmit_queue().
> >   */
> > -static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
> > +static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
> > +                               int pkts_delivered, int flag)
> >  {
> >       struct inet_connection_sock *icsk = inet_csk(sk);
> >       struct tcp_sock *tp = tcp_sk(sk);
> > @@ -3111,13 +3150,17 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
> >
> >               tp->bytes_acked = 0;
> >               tp->snd_cwnd_cnt = 0;
> > +             tp->prr_cwnd = tp->snd_cwnd;
>
> prior_cwnd would make this variable about 1000 times easier to read
> without requiring some background.
Patch v2 changes this to prior_cwnd.
>
> While at it, I realized that there is some restore window place in the
> generic code that assumes "prior_cwnd == 2 * tp->snd_ssthresh" which is
> not true with every single cc module.
>
> > +             tp->prr_delivered = 0;
> > +             tp->prr_out = 0;
> >               tcp_set_ca_state(sk, TCP_CA_Recovery);
> >               fast_rexmit = 1;
> >       }
> >
> >       if (do_lost || (tcp_is_fack(tp) && tcp_head_timedout(sk)))
> >               tcp_update_scoreboard(sk, fast_rexmit);
> > -     tcp_cwnd_down(sk, flag);
> > +     tp->prr_delivered += pkts_delivered;
>
> ...Is here a bug in the calculation if the recovery was triggered with
> _a cumulative ACK_ that had enough SACK blocks?
>
> > +     tcp_update_cwnd_in_recovery(sk, pkts_delivered, fast_rexmit, flag);
> >       tcp_xmit_retransmit_queue(sk);
> >  }
> >
> > @@ -3632,6 +3675,11 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
> >       u32 prior_in_flight;
> >       u32 prior_fackets;
> >       int prior_packets;
> > +     int prior_sacked = tp->sacked_out;
> > +     /* pkts_delivered is number of packets newly cumulatively acked or
> > +      * sacked on this ACK.
> > +      */
> > +     int pkts_delivered = 0;
>
> If you need a comment like that on variable declaration, maybe its name
> could be more obvious, pkts_acked_sacked ?
Changed this to newly_acked_sacked as per your suggestion.
>
> >       int frto_cwnd = 0;
> >
> >       /* If the ack is older than previous acks
> > @@ -3703,6 +3751,9 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
> >       /* See if we can take anything off of the retransmit queue. */
> >       flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una);
> > +     pkts_delivered = (prior_packets - prior_sacked) -
> > +                      (tp->packets_out - tp->sacked_out);
> > +
> >       if (tp->frto_counter)
> >               frto_cwnd = tcp_process_frto(sk, flag);
> >       /* Guarantee sacktag reordering detection against wrap-arounds */
> > @@ -3715,7 +3766,7 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
> >                   tcp_may_raise_cwnd(sk, flag))
> >                       tcp_cong_avoid(sk, ack, prior_in_flight);
> >               tcp_fastretrans_alert(sk, prior_packets - tp->packets_out,
> > -                                   flag);
> > +                                   pkts_delivered, flag);
> >       } else {
> >               if ((flag & FLAG_DATA_ACKED) && !frto_cwnd)
> >                       tcp_cong_avoid(sk, ack, prior_in_flight);
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 882e0b0..ca50408 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -1796,11 +1796,13 @@ static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> >               tcp_event_new_data_sent(sk, skb);
> >
> >               tcp_minshall_update(tp, mss_now, skb);
> > -             sent_pkts++;
> > +             sent_pkts += tcp_skb_pcount(skb);
> >
> >               if (push_one)
> >                       break;
> >       }
> > +     if (inet_csk(sk)->icsk_ca_state == TCP_CA_Recovery)
> > +             tp->prr_out += sent_pkts;
>
> Is there some harm done if you get rid of the conditionality?
>
> >       if (likely(sent_pkts)) {
> >               tcp_cwnd_validate(sk);
> > @@ -2294,6 +2296,9 @@ begin_fwd:
> >                       return;
> >               NET_INC_STATS_BH(sock_net(sk), mib_idx);
> >
> > +             if (inet_csk(sk)->icsk_ca_state == TCP_CA_Recovery)
> > +                     tp->prr_out += tcp_skb_pcount(skb);
> > +
>
> ...same here?
No harm done if the condition is removed in either of these case.
However, letting the condition remain for now for ease of readability.
>
> >               if (skb == tcp_write_queue_head(sk))
> >                       inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
> >                                                 inet_csk(sk)->icsk_rto,
>
> Besides the points above, I'm not convinced that we really need all the
> across-packet state that is present, some relevant values should be
> available before we do any ACK processing (in-flight, cwnd, and more
> importantly, the difference between them). ...But I haven't yet
> convinced myself of anything particular.
>
> --
>  i.
^ permalink raw reply	[flat|nested] 15+ messages in thread
* Re: [PATCH] Proportional Rate Reduction for TCP.
  2011-08-14  5:05 ` Andi Kleen
@ 2011-08-19  7:34   ` Nandita Dukkipati
  0 siblings, 0 replies; 15+ messages in thread
From: Nandita Dukkipati @ 2011-08-19  7:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David S. Miller, netdev, Tom Herbert, Yuchung Cheng, Matt Mathis
On Sat, Aug 13, 2011 at 10:05 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Nandita Dukkipati <nanditad@google.com> writes:
>> +
>> +     if (tcp_packets_in_flight(tp) > tp->snd_ssthresh) {
>> +             if (WARN_ON(!tp->prr_cwnd))
>> +                     tp->prr_cwnd = 1;
>> +             sndcnt = DIV_ROUND_UP(tp->prr_delivered * tp->snd_ssthresh,
>> +                                   tp->prr_cwnd) - tp->prr_out;
>> +     } else {
>> +             sndcnt = min_t(int, delta,
>> +                            max_t(int, tp->prr_delivered -
>> tp->prr_out,
>
> u32s here? This will likely do bad things with large enough windows.
Fixed in patch v2.
>
> The rest looks good to me as code, but I don't claim to fully understand the
> algorithm. Perhaps a sysctl to turn it off and a linux mib counter when
> it triggers would be useful in addition.
There already exists a mib counter when entering Recovery state, so
this will be incremented when PRR is triggered.
Thanks
Nandita
^ permalink raw reply	[flat|nested] 15+ messages in thread
* Re: [PATCH v2] Proportional Rate Reduction for TCP.
  2011-08-19  7:33 ` [PATCH v2] " Nandita Dukkipati
@ 2011-08-19 10:25   ` Ilpo Järvinen
  2011-08-20  1:28     ` Nandita Dukkipati
  2011-08-19 10:26   ` David Miller
  2011-08-20  1:29   ` [PATCH v3] " Nandita Dukkipati
  2 siblings, 1 reply; 15+ messages in thread
From: Ilpo Järvinen @ 2011-08-19 10:25 UTC (permalink / raw)
  To: Nandita Dukkipati
  Cc: David S. Miller, netdev, Tom Herbert, Matt Mathis, Yuchung Cheng
On Fri, 19 Aug 2011, Nandita Dukkipati wrote:
> +static void tcp_update_cwnd_in_recovery(struct sock *sk, int newly_acked_sacked,
> +					int fast_rexmit, int flag)
> +{
> +	struct tcp_sock *tp = tcp_sk(sk);
> +	int sndcnt = 0;
> +	int delta = tp->snd_ssthresh - tcp_packets_in_flight(tp);
> +
> +	if (tcp_packets_in_flight(tp) > tp->snd_ssthresh) {
> +		if (WARN_ON(!tp->prior_cwnd))
> +			tp->prior_cwnd = 1;
This should still be made larger to avoid problems if it ever will be 
needed.
> +		sndcnt = DIV_ROUND_UP((u64)(tp->prr_delivered *
> +					    tp->snd_ssthresh),
> +				      (u64)tp->prior_cwnd) - tp->prr_out;
I think you should pick one from include/linux/math64.h instead of letting 
gcc to do / operand all by itself. ...Obviosly then the ROUND_UP part 
needs to be done manually (either using the remainder or pre-addition 
like DIV_ROUND_UP does).
-- 
 i.
^ permalink raw reply	[flat|nested] 15+ messages in thread
* Re: [PATCH v2] Proportional Rate Reduction for TCP.
  2011-08-19  7:33 ` [PATCH v2] " Nandita Dukkipati
  2011-08-19 10:25   ` Ilpo Järvinen
@ 2011-08-19 10:26   ` David Miller
  2011-08-20  1:29     ` Nandita Dukkipati
  2011-08-20  1:29   ` [PATCH v3] " Nandita Dukkipati
  2 siblings, 1 reply; 15+ messages in thread
From: David Miller @ 2011-08-19 10:26 UTC (permalink / raw)
  To: nanditad; +Cc: netdev, therbert, mattmathis, ycheng
From: Nandita Dukkipati <nanditad@google.com>
Date: Fri, 19 Aug 2011 00:33:32 -0700
> @@ -2830,9 +2830,14 @@ static int tcp_try_undo_loss(struct sock *sk)
>  static inline void tcp_complete_cwr(struct sock *sk)
>  {
>  	struct tcp_sock *tp = tcp_sk(sk);
> -	/* Do not moderate cwnd if it's already undone in cwr or recovery */
> -	if (tp->undo_marker && tp->snd_cwnd > tp->snd_ssthresh) {
> -		tp->snd_cwnd = tp->snd_ssthresh;
> +
> +	/* Do not moderate cwnd if it's already undone in cwr or recovery. */
> +	if (tp->undo_marker) {
> +
> +		if (inet_csk(sk)->icsk_ca_state == TCP_CA_CWR)
Please get rid of that empty line before the TCP_CA_CWR case.
> +		sndcnt = DIV_ROUND_UP((u64)(tp->prr_delivered *
> +					    tp->snd_ssthresh),
> +				      (u64)tp->prior_cwnd) - tp->prr_out;
This won't link on 32-bit unless __divdi3 libgcc routine is provided
by the architecture.  To portably do 64-bit division you need to use
do_div() or something based upon it.  Perhaps DIV_ROUND_UP_LL() will
work best in this case.
^ permalink raw reply	[flat|nested] 15+ messages in thread
* Re: [PATCH v2] Proportional Rate Reduction for TCP.
  2011-08-19 10:25   ` Ilpo Järvinen
@ 2011-08-20  1:28     ` Nandita Dukkipati
  2011-08-20 12:41       ` Ilpo Järvinen
  0 siblings, 1 reply; 15+ messages in thread
From: Nandita Dukkipati @ 2011-08-20  1:28 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: David S. Miller, netdev, Tom Herbert, Matt Mathis, Yuchung Cheng
Forgot to turn off gmail's rich formatting, so re-sending to the list.
On Fri, Aug 19, 2011 at 3:25 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
>
> On Fri, 19 Aug 2011, Nandita Dukkipati wrote:
>
> > +static void tcp_update_cwnd_in_recovery(struct sock *sk, int newly_acked_sacked,
> > +                                     int fast_rexmit, int flag)
> > +{
> > +     struct tcp_sock *tp = tcp_sk(sk);
> > +     int sndcnt = 0;
> > +     int delta = tp->snd_ssthresh - tcp_packets_in_flight(tp);
> > +
> > +     if (tcp_packets_in_flight(tp) > tp->snd_ssthresh) {
> > +             if (WARN_ON(!tp->prior_cwnd))
> > +                     tp->prior_cwnd = 1;
>
> This should still be made larger to avoid problems if it ever will be
> needed.
I am letting the value remain at 1, mainly because this is the valid
lowest non-zero value for snd_cwnd to take on. The main purpose of
this code is to catch any lurking bug outside of PRR which results in
an undesirable divide by 0 in PRR. I would like to fix that bug if I
find this code is executed.
>
> > +             sndcnt = DIV_ROUND_UP((u64)(tp->prr_delivered *
> > +                                         tp->snd_ssthresh),
> > +                                   (u64)tp->prior_cwnd) - tp->prr_out;
>
> I think you should pick one from include/linux/math64.h instead of letting
> gcc to do / operand all by itself. ...Obviosly then the ROUND_UP part
> needs to be done manually (either using the remainder or pre-addition
> like DIV_ROUND_UP does).
Done. patch v3 used div_u64.
Thanks
Nandita
^ permalink raw reply	[flat|nested] 15+ messages in thread
* Re: [PATCH v2] Proportional Rate Reduction for TCP.
  2011-08-19 10:26   ` David Miller
@ 2011-08-20  1:29     ` Nandita Dukkipati
  0 siblings, 0 replies; 15+ messages in thread
From: Nandita Dukkipati @ 2011-08-20  1:29 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, therbert, mattmathis, ycheng
On Fri, Aug 19, 2011 at 3:26 AM, David Miller <davem@davemloft.net> wrote:
> From: Nandita Dukkipati <nanditad@google.com>
> Date: Fri, 19 Aug 2011 00:33:32 -0700
>
>> @@ -2830,9 +2830,14 @@ static int tcp_try_undo_loss(struct sock *sk)
>>  static inline void tcp_complete_cwr(struct sock *sk)
>>  {
>>       struct tcp_sock *tp = tcp_sk(sk);
>> -     /* Do not moderate cwnd if it's already undone in cwr or recovery */
>> -     if (tp->undo_marker && tp->snd_cwnd > tp->snd_ssthresh) {
>> -             tp->snd_cwnd = tp->snd_ssthresh;
>> +
>> +     /* Do not moderate cwnd if it's already undone in cwr or recovery. */
>> +     if (tp->undo_marker) {
>> +
>> +             if (inet_csk(sk)->icsk_ca_state == TCP_CA_CWR)
>
> Please get rid of that empty line before the TCP_CA_CWR case.
Done.
>
>> +             sndcnt = DIV_ROUND_UP((u64)(tp->prr_delivered *
>> +                                         tp->snd_ssthresh),
>> +                                   (u64)tp->prior_cwnd) - tp->prr_out;
>
> This won't link on 32-bit unless __divdi3 libgcc routine is provided
> by the architecture.  To portably do 64-bit division you need to use
> do_div() or something based upon it.  Perhaps DIV_ROUND_UP_LL() will
> work best in this case.
patch v3 uses div_u64 which is based on do_div() for 32-bit archs.
Thanks
Nandita
^ permalink raw reply	[flat|nested] 15+ messages in thread
* [PATCH v3] Proportional Rate Reduction for TCP.
  2011-08-19  7:33 ` [PATCH v2] " Nandita Dukkipati
  2011-08-19 10:25   ` Ilpo Järvinen
  2011-08-19 10:26   ` David Miller
@ 2011-08-20  1:29   ` Nandita Dukkipati
  2011-08-22  6:21     ` [PATCH v4] " Nandita Dukkipati
  2 siblings, 1 reply; 15+ messages in thread
From: Nandita Dukkipati @ 2011-08-20  1:29 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Tom Herbert, Matt Mathis, Yuchung Cheng,
	Nandita Dukkipati
This patch implements Proportional Rate Reduction (PRR) for TCP.
PRR is an algorithm that determines TCP's sending rate in fast
recovery. PRR avoids excessive window reductions and aims for
the actual congestion window size at the end of recovery to be as
close as possible to the window determined by the congestion control
algorithm. PRR also improves accuracy of the amount of data sent
during loss recovery.
The patch implements the recommended flavor of PRR called PRR-SSRB
(Proportional rate reduction with slow start reduction bound) and
replaces the existing rate halving algorithm. PRR improves upon the
existing Linux fast recovery under a number of conditions including:
  1) burst losses where the losses implicitly reduce the amount of
outstanding data (pipe) below the ssthresh value selected by the
congestion control algorithm and,
  2) losses near the end of short flows where application runs out of
data to send.
As an example, with the existing rate halving implementation a single
loss event can cause a connection carrying short Web transactions to
go into the slow start mode after the recovery. This is because during
recovery Linux pulls the congestion window down to packets_in_flight+1
on every ACK. A short Web response often runs out of new data to send
and its pipe reduces to zero by the end of recovery when all its packets
are drained from the network. Subsequent HTTP responses using the same
connection will have to slow start to raise cwnd to ssthresh. PRR on
the other hand aims for the cwnd to be as close as possible to ssthresh
by the end of recovery.
A description of PRR and a discussion of its performance can be found at
the following links:
- IETF Draft:
    http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01
- IETF Slides:
    http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf
    http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf
- Paper to appear in Internet Measurements Conference (IMC) 2011:
    Improving TCP Loss Recovery
    Nandita Dukkipati, Matt Mathis, Yuchung Cheng
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
---
Changelog since v2:
- Used div_u64 in tcp_update_cwnd_in_recovery() to ensure portable 64-bit division. 
- Removed an empty line in tcp_complete_cwr().
Changelog since v1:
- Took care of overflow for large congestion windows in tcp_update_cwnd_in_recovery().
- Renamed prr_cwnd to prior_cwnd.
- Renamed pkts_delivered to newly_acked_sacked.
 include/linux/tcp.h   |    4 +++
 net/ipv4/tcp_input.c  |   61 ++++++++++++++++++++++++++++++++++++++++++++-----
 net/ipv4/tcp_output.c |    7 +++++-
 3 files changed, 65 insertions(+), 7 deletions(-)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 531ede8..6b63b31 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -379,6 +379,10 @@ struct tcp_sock {
 	u32	snd_cwnd_clamp; /* Do not allow snd_cwnd to grow above this */
 	u32	snd_cwnd_used;
 	u32	snd_cwnd_stamp;
+	u32	prior_cwnd;	/* Congestion window at start of Recovery. */
+	u32	prr_delivered;	/* Number of newly delivered packets to
+				 * receiver in Recovery. */
+	u32	prr_out;	/* Total number of pkts sent during Recovery. */
 
  	u32	rcv_wnd;	/* Current receiver window		*/
 	u32	write_seq;	/* Tail(+1) of data held in tcp send buffer */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ea0d218..40408f1 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2830,9 +2830,13 @@ static int tcp_try_undo_loss(struct sock *sk)
 static inline void tcp_complete_cwr(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	/* Do not moderate cwnd if it's already undone in cwr or recovery */
-	if (tp->undo_marker && tp->snd_cwnd > tp->snd_ssthresh) {
-		tp->snd_cwnd = tp->snd_ssthresh;
+
+	/* Do not moderate cwnd if it's already undone in cwr or recovery. */
+	if (tp->undo_marker) {
+		if (inet_csk(sk)->icsk_ca_state == TCP_CA_CWR)
+			tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh);
+		else /* PRR */
+			tp->snd_cwnd = tp->snd_ssthresh;
 		tp->snd_cwnd_stamp = tcp_time_stamp;
 	}
 	tcp_ca_event(sk, CA_EVENT_COMPLETE_CWR);
@@ -2950,6 +2954,41 @@ void tcp_simple_retransmit(struct sock *sk)
 }
 EXPORT_SYMBOL(tcp_simple_retransmit);
 
+/* This function implements the PRR algorithm, specifcally the PRR-SSRB
+ * (proportional rate reduction with slow start reduction bound) as described in
+ * http://www.ietf.org/id/draft-mathis-tcpm-proportional-rate-reduction-01.txt.
+ * It computes the number of packets to send (sndcnt) based on packets newly
+ * delivered:
+ *   1) If the packets in flight is larger than ssthresh, PRR spreads the
+ *	cwnd reductions across a full RTT.
+ *   2) If packets in flight is lower than ssthresh (such as due to excess
+ *	losses and/or application stalls), do not perform any further cwnd
+ *	reductions, but instead slow start up to ssthresh.
+ */
+static void tcp_update_cwnd_in_recovery(struct sock *sk, int newly_acked_sacked,
+					int fast_rexmit, int flag)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	int sndcnt = 0;
+	int delta = tp->snd_ssthresh - tcp_packets_in_flight(tp);
+
+	if (tcp_packets_in_flight(tp) > tp->snd_ssthresh) {
+		u64 dividend = 0;
+		if (WARN_ON(!tp->prior_cwnd))
+			tp->prior_cwnd = 1;
+		dividend = (u64)tp->snd_ssthresh * tp->prr_delivered +
+			   tp->prior_cwnd - 1;
+		sndcnt = div_u64(dividend, tp->prior_cwnd) - tp->prr_out;
+	} else {
+		sndcnt = min_t(int, delta,
+			       max_t(int, tp->prr_delivered - tp->prr_out,
+				     newly_acked_sacked) + 1);
+	}
+
+	sndcnt = max(sndcnt, (fast_rexmit ? 1 : 0));
+	tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
+}
+
 /* Process an event, which can update packets-in-flight not trivially.
  * Main goal of this function is to calculate new estimate for left_out,
  * taking into account both packets sitting in receiver's buffer and
@@ -2961,7 +3000,8 @@ EXPORT_SYMBOL(tcp_simple_retransmit);
  * It does _not_ decide what to send, it is made in function
  * tcp_xmit_retransmit_queue().
  */
-static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
+static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
+				  int newly_acked_sacked, int flag)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -3111,13 +3151,17 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
 
 		tp->bytes_acked = 0;
 		tp->snd_cwnd_cnt = 0;
+		tp->prior_cwnd = tp->snd_cwnd;
+		tp->prr_delivered = 0;
+		tp->prr_out = 0;
 		tcp_set_ca_state(sk, TCP_CA_Recovery);
 		fast_rexmit = 1;
 	}
 
 	if (do_lost || (tcp_is_fack(tp) && tcp_head_timedout(sk)))
 		tcp_update_scoreboard(sk, fast_rexmit);
-	tcp_cwnd_down(sk, flag);
+	tp->prr_delivered += newly_acked_sacked;
+	tcp_update_cwnd_in_recovery(sk, newly_acked_sacked, fast_rexmit, flag);
 	tcp_xmit_retransmit_queue(sk);
 }
 
@@ -3632,6 +3676,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
 	u32 prior_in_flight;
 	u32 prior_fackets;
 	int prior_packets;
+	int prior_sacked = tp->sacked_out;
+	int newly_acked_sacked = 0;
 	int frto_cwnd = 0;
 
 	/* If the ack is older than previous acks
@@ -3703,6 +3749,9 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
 	/* See if we can take anything off of the retransmit queue. */
 	flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una);
 
+	newly_acked_sacked = (prior_packets - prior_sacked) -
+			     (tp->packets_out - tp->sacked_out);
+
 	if (tp->frto_counter)
 		frto_cwnd = tcp_process_frto(sk, flag);
 	/* Guarantee sacktag reordering detection against wrap-arounds */
@@ -3715,7 +3764,7 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
 		    tcp_may_raise_cwnd(sk, flag))
 			tcp_cong_avoid(sk, ack, prior_in_flight);
 		tcp_fastretrans_alert(sk, prior_packets - tp->packets_out,
-				      flag);
+				      newly_acked_sacked, flag);
 	} else {
 		if ((flag & FLAG_DATA_ACKED) && !frto_cwnd)
 			tcp_cong_avoid(sk, ack, prior_in_flight);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 882e0b0..ca50408 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1796,11 +1796,13 @@ static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		tcp_event_new_data_sent(sk, skb);
 
 		tcp_minshall_update(tp, mss_now, skb);
-		sent_pkts++;
+		sent_pkts += tcp_skb_pcount(skb);
 
 		if (push_one)
 			break;
 	}
+	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Recovery)
+		tp->prr_out += sent_pkts;
 
 	if (likely(sent_pkts)) {
 		tcp_cwnd_validate(sk);
@@ -2294,6 +2296,9 @@ begin_fwd:
 			return;
 		NET_INC_STATS_BH(sock_net(sk), mib_idx);
 
+		if (inet_csk(sk)->icsk_ca_state == TCP_CA_Recovery)
+			tp->prr_out += tcp_skb_pcount(skb);
+
 		if (skb == tcp_write_queue_head(sk))
 			inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
 						  inet_csk(sk)->icsk_rto,
-- 
1.7.3.1
^ permalink raw reply related	[flat|nested] 15+ messages in thread
* Re: [PATCH v2] Proportional Rate Reduction for TCP.
  2011-08-20  1:28     ` Nandita Dukkipati
@ 2011-08-20 12:41       ` Ilpo Järvinen
  2011-08-22  6:21         ` Nandita Dukkipati
  0 siblings, 1 reply; 15+ messages in thread
From: Ilpo Järvinen @ 2011-08-20 12:41 UTC (permalink / raw)
  To: Nandita Dukkipati
  Cc: David S. Miller, Netdev, Tom Herbert, Matt Mathis, Yuchung Cheng
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1430 bytes --]
On Fri, 19 Aug 2011, Nandita Dukkipati wrote:
> Forgot to turn off gmail's rich formatting, so re-sending to the list.
> 
> On Fri, Aug 19, 2011 at 3:25 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> >
> > On Fri, 19 Aug 2011, Nandita Dukkipati wrote:
> >
> > > +static void tcp_update_cwnd_in_recovery(struct sock *sk, int newly_acked_sacked,
> > > +                                     int fast_rexmit, int flag)
> > > +{
> > > +     struct tcp_sock *tp = tcp_sk(sk);
> > > +     int sndcnt = 0;
> > > +     int delta = tp->snd_ssthresh - tcp_packets_in_flight(tp);
> > > +
> > > +     if (tcp_packets_in_flight(tp) > tp->snd_ssthresh) {
> > > +             if (WARN_ON(!tp->prior_cwnd))
> > > +                     tp->prior_cwnd = 1;
> >
> > This should still be made larger to avoid problems if it ever will be
> > needed.
> 
> I am letting the value remain at 1, mainly because this is the valid
> lowest non-zero value for snd_cwnd to take on. The main purpose of
> this code is to catch any lurking bug outside of PRR which results in
> an undesirable divide by 0 in PRR. I would like to fix that bug if I
> find this code is executed.
NACK, until this value is at least 2 * tp->snd_ssthresh. Or alternatively 
the fallback is removed so that we DBZ and do not end up wrecking the 
network.
Other than that I'm ok with the patch (assuming the branches I brought
up earlier is ok for everybody else).
-- 
 i.
^ permalink raw reply	[flat|nested] 15+ messages in thread
* Re: [PATCH v2] Proportional Rate Reduction for TCP.
  2011-08-20 12:41       ` Ilpo Järvinen
@ 2011-08-22  6:21         ` Nandita Dukkipati
  0 siblings, 0 replies; 15+ messages in thread
From: Nandita Dukkipati @ 2011-08-22  6:21 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: David S. Miller, Netdev, Tom Herbert, Matt Mathis, Yuchung Cheng
On Sat, Aug 20, 2011 at 5:41 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
> On Fri, 19 Aug 2011, Nandita Dukkipati wrote:
>
>> Forgot to turn off gmail's rich formatting, so re-sending to the list.
>>
>> On Fri, Aug 19, 2011 at 3:25 AM, Ilpo Järvinen
>> <ilpo.jarvinen@helsinki.fi> wrote:
>> >
>> > On Fri, 19 Aug 2011, Nandita Dukkipati wrote:
>> >
>> > > +static void tcp_update_cwnd_in_recovery(struct sock *sk, int newly_acked_sacked,
>> > > +                                     int fast_rexmit, int flag)
>> > > +{
>> > > +     struct tcp_sock *tp = tcp_sk(sk);
>> > > +     int sndcnt = 0;
>> > > +     int delta = tp->snd_ssthresh - tcp_packets_in_flight(tp);
>> > > +
>> > > +     if (tcp_packets_in_flight(tp) > tp->snd_ssthresh) {
>> > > +             if (WARN_ON(!tp->prior_cwnd))
>> > > +                     tp->prior_cwnd = 1;
>> >
>> > This should still be made larger to avoid problems if it ever will be
>> > needed.
>>
>> I am letting the value remain at 1, mainly because this is the valid
>> lowest non-zero value for snd_cwnd to take on. The main purpose of
>> this code is to catch any lurking bug outside of PRR which results in
>> an undesirable divide by 0 in PRR. I would like to fix that bug if I
>> find this code is executed.
>
> NACK, until this value is at least 2 * tp->snd_ssthresh. Or alternatively
> the fallback is removed so that we DBZ and do not end up wrecking the
> network.
If prior_cwnd is suspect at this point, snd_ssthresh cannot be trusted either.
So it appears to me that it's best to remove the fallback and spare
the network.
Done in patch v4.
Nandita
>
> Other than that I'm ok with the patch (assuming the branches I brought
> up earlier is ok for everybody else).
>
>
> --
>  i.
^ permalink raw reply	[flat|nested] 15+ messages in thread
* [PATCH v4] Proportional Rate Reduction for TCP.
  2011-08-20  1:29   ` [PATCH v3] " Nandita Dukkipati
@ 2011-08-22  6:21     ` Nandita Dukkipati
  2011-08-25  2:43       ` David Miller
  0 siblings, 1 reply; 15+ messages in thread
From: Nandita Dukkipati @ 2011-08-22  6:21 UTC (permalink / raw)
  To: David S. Miller
  Cc: netdev, Tom Herbert, Matt Mathis, Yuchung Cheng,
	Nandita Dukkipati
This patch implements Proportional Rate Reduction (PRR) for TCP.
PRR is an algorithm that determines TCP's sending rate in fast
recovery. PRR avoids excessive window reductions and aims for
the actual congestion window size at the end of recovery to be as
close as possible to the window determined by the congestion control
algorithm. PRR also improves accuracy of the amount of data sent
during loss recovery.
The patch implements the recommended flavor of PRR called PRR-SSRB
(Proportional rate reduction with slow start reduction bound) and
replaces the existing rate halving algorithm. PRR improves upon the
existing Linux fast recovery under a number of conditions including:
  1) burst losses where the losses implicitly reduce the amount of
outstanding data (pipe) below the ssthresh value selected by the
congestion control algorithm and,
  2) losses near the end of short flows where application runs out of
data to send.
As an example, with the existing rate halving implementation a single
loss event can cause a connection carrying short Web transactions to
go into the slow start mode after the recovery. This is because during
recovery Linux pulls the congestion window down to packets_in_flight+1
on every ACK. A short Web response often runs out of new data to send
and its pipe reduces to zero by the end of recovery when all its packets
are drained from the network. Subsequent HTTP responses using the same
connection will have to slow start to raise cwnd to ssthresh. PRR on
the other hand aims for the cwnd to be as close as possible to ssthresh
by the end of recovery.
A description of PRR and a discussion of its performance can be found at
the following links:
- IETF Draft:
    http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01
- IETF Slides:
    http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf
    http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf
- Paper to appear in Internet Measurements Conference (IMC) 2011:
    Improving TCP Loss Recovery
    Nandita Dukkipati, Matt Mathis, Yuchung Cheng
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
---
Changelog since v3:
- Removed the fallback of cwnd to 1 in tcp_update_cwnd_recovery(), for the extremely
  unlikely event that tp->prior_cwnd takes on a value of 0 (bug outside of PRR).
Changelog since v2:
- Used div_u64 in tcp_update_cwnd_in_recovery() to ensure portable 64-bit division.
- Removed an empty line in tcp_complete_cwr().
Changelog since v1:
- Took care of overflow for large congestion windows in tcp_update_cwnd_in_recovery().
- Renamed prr_cwnd to prior_cwnd.
- Renamed pkts_delivered to newly_acked_sacked.
 include/linux/tcp.h   |    4 +++
 net/ipv4/tcp_input.c  |   58 +++++++++++++++++++++++++++++++++++++++++++-----
 net/ipv4/tcp_output.c |    7 +++++-
 3 files changed, 62 insertions(+), 7 deletions(-)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 531ede8..6b63b31 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -379,6 +379,10 @@ struct tcp_sock {
 	u32	snd_cwnd_clamp; /* Do not allow snd_cwnd to grow above this */
 	u32	snd_cwnd_used;
 	u32	snd_cwnd_stamp;
+	u32	prior_cwnd;	/* Congestion window at start of Recovery. */
+	u32	prr_delivered;	/* Number of newly delivered packets to
+				 * receiver in Recovery. */
+	u32	prr_out;	/* Total number of pkts sent during Recovery. */
 
  	u32	rcv_wnd;	/* Current receiver window		*/
 	u32	write_seq;	/* Tail(+1) of data held in tcp send buffer */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ea0d218..385c470 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2830,9 +2830,13 @@ static int tcp_try_undo_loss(struct sock *sk)
 static inline void tcp_complete_cwr(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	/* Do not moderate cwnd if it's already undone in cwr or recovery */
-	if (tp->undo_marker && tp->snd_cwnd > tp->snd_ssthresh) {
-		tp->snd_cwnd = tp->snd_ssthresh;
+
+	/* Do not moderate cwnd if it's already undone in cwr or recovery. */
+	if (tp->undo_marker) {
+		if (inet_csk(sk)->icsk_ca_state == TCP_CA_CWR)
+			tp->snd_cwnd = min(tp->snd_cwnd, tp->snd_ssthresh);
+		else /* PRR */
+			tp->snd_cwnd = tp->snd_ssthresh;
 		tp->snd_cwnd_stamp = tcp_time_stamp;
 	}
 	tcp_ca_event(sk, CA_EVENT_COMPLETE_CWR);
@@ -2950,6 +2954,38 @@ void tcp_simple_retransmit(struct sock *sk)
 }
 EXPORT_SYMBOL(tcp_simple_retransmit);
 
+/* This function implements the PRR algorithm, specifcally the PRR-SSRB
+ * (proportional rate reduction with slow start reduction bound) as described in
+ * http://www.ietf.org/id/draft-mathis-tcpm-proportional-rate-reduction-01.txt.
+ * It computes the number of packets to send (sndcnt) based on packets newly
+ * delivered:
+ *   1) If the packets in flight is larger than ssthresh, PRR spreads the
+ *	cwnd reductions across a full RTT.
+ *   2) If packets in flight is lower than ssthresh (such as due to excess
+ *	losses and/or application stalls), do not perform any further cwnd
+ *	reductions, but instead slow start up to ssthresh.
+ */
+static void tcp_update_cwnd_in_recovery(struct sock *sk, int newly_acked_sacked,
+					int fast_rexmit, int flag)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	int sndcnt = 0;
+	int delta = tp->snd_ssthresh - tcp_packets_in_flight(tp);
+
+	if (tcp_packets_in_flight(tp) > tp->snd_ssthresh) {
+		u64 dividend = (u64)tp->snd_ssthresh * tp->prr_delivered +
+			       tp->prior_cwnd - 1;
+		sndcnt = div_u64(dividend, tp->prior_cwnd) - tp->prr_out;
+	} else {
+		sndcnt = min_t(int, delta,
+			       max_t(int, tp->prr_delivered - tp->prr_out,
+				     newly_acked_sacked) + 1);
+	}
+
+	sndcnt = max(sndcnt, (fast_rexmit ? 1 : 0));
+	tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
+}
+
 /* Process an event, which can update packets-in-flight not trivially.
  * Main goal of this function is to calculate new estimate for left_out,
  * taking into account both packets sitting in receiver's buffer and
@@ -2961,7 +2997,8 @@ EXPORT_SYMBOL(tcp_simple_retransmit);
  * It does _not_ decide what to send, it is made in function
  * tcp_xmit_retransmit_queue().
  */
-static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
+static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
+				  int newly_acked_sacked, int flag)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -3111,13 +3148,17 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
 
 		tp->bytes_acked = 0;
 		tp->snd_cwnd_cnt = 0;
+		tp->prior_cwnd = tp->snd_cwnd;
+		tp->prr_delivered = 0;
+		tp->prr_out = 0;
 		tcp_set_ca_state(sk, TCP_CA_Recovery);
 		fast_rexmit = 1;
 	}
 
 	if (do_lost || (tcp_is_fack(tp) && tcp_head_timedout(sk)))
 		tcp_update_scoreboard(sk, fast_rexmit);
-	tcp_cwnd_down(sk, flag);
+	tp->prr_delivered += newly_acked_sacked;
+	tcp_update_cwnd_in_recovery(sk, newly_acked_sacked, fast_rexmit, flag);
 	tcp_xmit_retransmit_queue(sk);
 }
 
@@ -3632,6 +3673,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
 	u32 prior_in_flight;
 	u32 prior_fackets;
 	int prior_packets;
+	int prior_sacked = tp->sacked_out;
+	int newly_acked_sacked = 0;
 	int frto_cwnd = 0;
 
 	/* If the ack is older than previous acks
@@ -3703,6 +3746,9 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
 	/* See if we can take anything off of the retransmit queue. */
 	flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una);
 
+	newly_acked_sacked = (prior_packets - prior_sacked) -
+			     (tp->packets_out - tp->sacked_out);
+
 	if (tp->frto_counter)
 		frto_cwnd = tcp_process_frto(sk, flag);
 	/* Guarantee sacktag reordering detection against wrap-arounds */
@@ -3715,7 +3761,7 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag)
 		    tcp_may_raise_cwnd(sk, flag))
 			tcp_cong_avoid(sk, ack, prior_in_flight);
 		tcp_fastretrans_alert(sk, prior_packets - tp->packets_out,
-				      flag);
+				      newly_acked_sacked, flag);
 	} else {
 		if ((flag & FLAG_DATA_ACKED) && !frto_cwnd)
 			tcp_cong_avoid(sk, ack, prior_in_flight);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 882e0b0..ca50408 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1796,11 +1796,13 @@ static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		tcp_event_new_data_sent(sk, skb);
 
 		tcp_minshall_update(tp, mss_now, skb);
-		sent_pkts++;
+		sent_pkts += tcp_skb_pcount(skb);
 
 		if (push_one)
 			break;
 	}
+	if (inet_csk(sk)->icsk_ca_state == TCP_CA_Recovery)
+		tp->prr_out += sent_pkts;
 
 	if (likely(sent_pkts)) {
 		tcp_cwnd_validate(sk);
@@ -2294,6 +2296,9 @@ begin_fwd:
 			return;
 		NET_INC_STATS_BH(sock_net(sk), mib_idx);
 
+		if (inet_csk(sk)->icsk_ca_state == TCP_CA_Recovery)
+			tp->prr_out += tcp_skb_pcount(skb);
+
 		if (skb == tcp_write_queue_head(sk))
 			inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
 						  inet_csk(sk)->icsk_rto,
-- 
1.7.3.1
^ permalink raw reply related	[flat|nested] 15+ messages in thread
* Re: [PATCH v4] Proportional Rate Reduction for TCP.
  2011-08-22  6:21     ` [PATCH v4] " Nandita Dukkipati
@ 2011-08-25  2:43       ` David Miller
  0 siblings, 0 replies; 15+ messages in thread
From: David Miller @ 2011-08-25  2:43 UTC (permalink / raw)
  To: nanditad; +Cc: netdev, therbert, mattmathis, ycheng
From: Nandita Dukkipati <nanditad@google.com>
Date: Sun, 21 Aug 2011 23:21:57 -0700
> This patch implements Proportional Rate Reduction (PRR) for TCP.
> PRR is an algorithm that determines TCP's sending rate in fast
> recovery. PRR avoids excessive window reductions and aims for
> the actual congestion window size at the end of recovery to be as
> close as possible to the window determined by the congestion control
> algorithm. PRR also improves accuracy of the amount of data sent
> during loss recovery.
> 
> The patch implements the recommended flavor of PRR called PRR-SSRB
> (Proportional rate reduction with slow start reduction bound) and
> replaces the existing rate halving algorithm. PRR improves upon the
> existing Linux fast recovery under a number of conditions including:
>   1) burst losses where the losses implicitly reduce the amount of
> outstanding data (pipe) below the ssthresh value selected by the
> congestion control algorithm and,
>   2) losses near the end of short flows where application runs out of
> data to send.
> 
> As an example, with the existing rate halving implementation a single
> loss event can cause a connection carrying short Web transactions to
> go into the slow start mode after the recovery. This is because during
> recovery Linux pulls the congestion window down to packets_in_flight+1
> on every ACK. A short Web response often runs out of new data to send
> and its pipe reduces to zero by the end of recovery when all its packets
> are drained from the network. Subsequent HTTP responses using the same
> connection will have to slow start to raise cwnd to ssthresh. PRR on
> the other hand aims for the cwnd to be as close as possible to ssthresh
> by the end of recovery.
> 
> A description of PRR and a discussion of its performance can be found at
> the following links:
> - IETF Draft:
>     http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01
> - IETF Slides:
>     http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf
>     http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf
> - Paper to appear in Internet Measurements Conference (IMC) 2011:
>     Improving TCP Loss Recovery
>     Nandita Dukkipati, Matt Mathis, Yuchung Cheng
> 
> Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Applied.
^ permalink raw reply	[flat|nested] 15+ messages in thread
end of thread, other threads:[~2011-08-25  2:43 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-12  7:29 [PATCH] Proportional Rate Reduction for TCP Nandita Dukkipati
2011-08-12 11:36 ` Ilpo Järvinen
2011-08-19  7:34   ` Nandita Dukkipati
2011-08-14  5:05 ` Andi Kleen
2011-08-19  7:34   ` Nandita Dukkipati
2011-08-19  7:33 ` [PATCH v2] " Nandita Dukkipati
2011-08-19 10:25   ` Ilpo Järvinen
2011-08-20  1:28     ` Nandita Dukkipati
2011-08-20 12:41       ` Ilpo Järvinen
2011-08-22  6:21         ` Nandita Dukkipati
2011-08-19 10:26   ` David Miller
2011-08-20  1:29     ` Nandita Dukkipati
2011-08-20  1:29   ` [PATCH v3] " Nandita Dukkipati
2011-08-22  6:21     ` [PATCH v4] " Nandita Dukkipati
2011-08-25  2:43       ` David Miller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).