[PATCH 1/5] tcp: make is_dupack a parameter to tcp_fastretrans

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/5] tcp: make is_dupack a parameter to tcp_fastretrans_alert()
@ 2011-11-16 18:58 Neal Cardwell
  2011-11-16 18:58 ` [PATCH 2/5] tcp: use DSACKs that arrive when packets_out is 0 Neal Cardwell
                   ` (4 more replies)
  0 siblings, 5 replies; 30+ messages in thread
From: Neal Cardwell @ 2011-11-16 18:58 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, ilpo.jarvinen, Nandita Dukkipati, Yuchung Cheng,
	Tom Herbert, Neal Cardwell

Allow callers to decide whether an ACK is a duplicate ACK. This is a
prerequisite to allowing fastretrans_alert to be called from new
contexts, such as the no_queue and old_ack code paths, from which we
have extra info that tells us whether an ACK is a dupack.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
 net/ipv4/tcp_input.c |   12 ++++++++----
 1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 52b5c2d..f772aaa 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3009,11 +3009,11 @@ static void tcp_update_cwnd_in_recovery(struct sock *sk, int newly_acked_sacked,
  * tcp_xmit_retransmit_queue().
  */
 static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
-				  int newly_acked_sacked, int flag)
+				  int newly_acked_sacked, bool is_dupack,
+				  int flag)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
-	int is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP));
 	int do_lost = is_dupack || ((flag & FLAG_DATA_SACKED) &&
 				    (tcp_fackets_out(tp) > tp->reordering));
 	int fast_rexmit = 0, mib_idx;
@@ -3681,10 +3681,12 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	u32 prior_snd_una = tp->snd_una;
 	u32 ack_seq = TCP_SKB_CB(skb)->seq;
 	u32 ack = TCP_SKB_CB(skb)->ack_seq;
+	bool is_dupack = false;
 	u32 prior_in_flight;
 	u32 prior_fackets;
 	int prior_packets;
 	int prior_sacked = tp->sacked_out;
+	int pkts_acked = 0;
 	int newly_acked_sacked = 0;
 	int frto_cwnd = 0;
 
@@ -3757,6 +3759,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	/* See if we can take anything off of the retransmit queue. */
 	flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una);
 
+	pkts_acked = prior_packets - tp->packets_out;
 	newly_acked_sacked = (prior_packets - prior_sacked) -
 			     (tp->packets_out - tp->sacked_out);
 
@@ -3771,8 +3774,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 		if ((flag & FLAG_DATA_ACKED) && !frto_cwnd &&
 		    tcp_may_raise_cwnd(sk, flag))
 			tcp_cong_avoid(sk, ack, prior_in_flight);
-		tcp_fastretrans_alert(sk, prior_packets - tp->packets_out,
-				      newly_acked_sacked, flag);
+		is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP));
+		tcp_fastretrans_alert(sk, pkts_acked, newly_acked_sacked,
+				      is_dupack, flag);
 	} else {
 		if ((flag & FLAG_DATA_ACKED) && !frto_cwnd)
 			tcp_cong_avoid(sk, ack, prior_in_flight);
-- 
1.7.3.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/5] tcp: use DSACKs that arrive when packets_out is 0
  2011-11-16 18:58 [PATCH 1/5] tcp: make is_dupack a parameter to tcp_fastretrans_alert() Neal Cardwell
@ 2011-11-16 18:58 ` Neal Cardwell
  2011-11-17  1:34   ` Ilpo Järvinen
  2011-11-16 18:58 ` [PATCH 3/5] tcp: use SACKs and DSACKs that arrive on ACKs below snd_una Neal Cardwell
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: Neal Cardwell @ 2011-11-16 18:58 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, ilpo.jarvinen, Nandita Dukkipati, Yuchung Cheng,
	Tom Herbert, Neal Cardwell

The bug: Senders ignored DSACKs after recovery when there were no
outstanding packets (a common scenario for HTTP servers).

The change: when there are no outstanding packets (the "no_queue" goto
label), call tcp_fastretrans_alert() in order to use DSACKs to undo
congestion window reductions.

Other patches in this series will provide other changes that are
necessary to fully fix this problem.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
 net/ipv4/tcp_input.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f772aaa..b49e418 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3788,6 +3788,10 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	return 1;
 
 no_queue:
+	/* If data was DSACKed, see if we can undo a cwnd reduction. */
+	if (flag & FLAG_DSACKING_ACK)
+		tcp_fastretrans_alert(sk, pkts_acked, newly_acked_sacked,
+				      is_dupack, flag);
 	/* If this ack opens up a zero window, clear backoff.  It was
 	 * being used to time the probes, and is probably far higher than
 	 * it needs to be for normal retransmission.
-- 
1.7.3.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/5] tcp: use SACKs and DSACKs that arrive on ACKs below snd_una
  2011-11-16 18:58 [PATCH 1/5] tcp: make is_dupack a parameter to tcp_fastretrans_alert() Neal Cardwell
  2011-11-16 18:58 ` [PATCH 2/5] tcp: use DSACKs that arrive when packets_out is 0 Neal Cardwell
@ 2011-11-16 18:58 ` Neal Cardwell
  2011-11-17  1:52   ` Ilpo Järvinen
  2011-11-16 18:58 ` [PATCH 4/5] tcp: allow undo from reordered DSACKs Neal Cardwell
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 30+ messages in thread
From: Neal Cardwell @ 2011-11-16 18:58 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, ilpo.jarvinen, Nandita Dukkipati, Yuchung Cheng,
	Tom Herbert, Neal Cardwell

The bug: When the ACK field is below snd_una (which can happen when
ACKs are reordered), senders ignored DSACKs (preventing undo) and did
not call tcp_fastretrans_alert, so they did not increment
prr_delivered to reflect newly-SACKed sequence ranges, and did not
call tcp_xmit_retransmit_queue, thus passing up chances to send out
more retransmitted and new packets based on any newly-SACKed packets.

The change: When the ACK field is below snd_una (the "old_ack" goto
label), call tcp_fastretrans_alert to allow undo based on any
newly-arrived DSACKs and try to send out more packets based on
newly-SACKed packets.

Other patches in this series will provide other changes that are
necessary to fully fix this problem.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
 net/ipv4/tcp_input.c |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b49e418..751d390 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3805,10 +3805,14 @@ invalid_ack:
 	return -1;
 
 old_ack:
+	/* If data was SACKed, tag it and see if we should send more data.
+	 * If data was DSACKed, see if we can undo a cwnd reduction.
+	 */
 	if (TCP_SKB_CB(skb)->sacked) {
-		tcp_sacktag_write_queue(sk, skb, prior_snd_una);
-		if (icsk->icsk_ca_state == TCP_CA_Open)
-			tcp_try_keep_open(sk);
+		flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una);
+		newly_acked_sacked = tp->sacked_out - prior_sacked;
+		tcp_fastretrans_alert(sk, pkts_acked, newly_acked_sacked,
+				      is_dupack, flag);
 	}
 
 	SOCK_DEBUG(sk, "Ack %u before %u:%u\n", ack, tp->snd_una, tp->snd_nxt);
-- 
1.7.3.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/5] tcp: allow undo from reordered DSACKs
  2011-11-16 18:58 [PATCH 1/5] tcp: make is_dupack a parameter to tcp_fastretrans_alert() Neal Cardwell
  2011-11-16 18:58 ` [PATCH 2/5] tcp: use DSACKs that arrive when packets_out is 0 Neal Cardwell
  2011-11-16 18:58 ` [PATCH 3/5] tcp: use SACKs and DSACKs that arrive on ACKs below snd_una Neal Cardwell
@ 2011-11-16 18:58 ` Neal Cardwell
  2011-11-17  5:18   ` Ilpo Järvinen
  2011-11-16 18:58 ` [PATCH 5/5] tcp: skip cwnd moderation in TCP_CA_Open in tcp_try_to_open Neal Cardwell
  2011-11-17  1:23 ` [PATCH 1/5] tcp: make is_dupack a parameter to tcp_fastretrans_alert() Ilpo Järvinen
  4 siblings, 1 reply; 30+ messages in thread
From: Neal Cardwell @ 2011-11-16 18:58 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, ilpo.jarvinen, Nandita Dukkipati, Yuchung Cheng,
	Tom Herbert, Neal Cardwell

Previously, SACK-enabled connections hung around in TCP_CA_Disorder
state while snd_una==high_seq, just waiting to accumulate DSACKs and
hopefully undo a cwnd reduction. This could and did lead to the
following unfortunate scenario: if some incoming ACKs advance snd_una
beyond high_seq then we were setting undo_marker to 0 and moving to
TCP_CA_Open, so if (due to reordering in the ACK return path) we
shortly thereafter received a DSACK then we were no longer able to
undo the cwnd reduction.

The change: Simplify the congestion avoidance state machine by
removing the behavior where SACK-enabled connections hung around in
the TCP_CA_Disorder state just waiting for DSACKs. Instead, when
snd_una advances to high_seq or beyond we typically move to
TCP_CA_Open immediately and allow an undo in either TCP_CA_Open or
TCP_CA_Disorder if we later receive enough DSACKs.

Other patches in this series will provide other changes that are
necessary to fully fix this problem.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
 net/ipv4/tcp_input.c |   15 ++-------------
 1 files changed, 2 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 751d390..a4efdd7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2858,7 +2858,7 @@ static void tcp_try_keep_open(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 	int state = TCP_CA_Open;
 
-	if (tcp_left_out(tp) || tcp_any_retrans_done(sk) || tp->undo_marker)
+	if (tcp_left_out(tp) || tcp_any_retrans_done(sk))
 		state = TCP_CA_Disorder;
 
 	if (inet_csk(sk)->icsk_ca_state != state) {
@@ -3066,17 +3066,6 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
 			}
 			break;
 
-		case TCP_CA_Disorder:
-			tcp_try_undo_dsack(sk);
-			if (!tp->undo_marker ||
-			    /* For SACK case do not Open to allow to undo
-			     * catching for all duplicate ACKs. */
-			    tcp_is_reno(tp) || tp->snd_una != tp->high_seq) {
-				tp->undo_marker = 0;
-				tcp_set_ca_state(sk, TCP_CA_Open);
-			}
-			break;
-
 		case TCP_CA_Recovery:
 			if (tcp_is_reno(tp))
 				tcp_reset_reno_sack(tp);
@@ -3117,7 +3106,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
 				tcp_add_reno_sack(sk);
 		}
 
-		if (icsk->icsk_ca_state == TCP_CA_Disorder)
+		if (icsk->icsk_ca_state <= TCP_CA_Disorder)
 			tcp_try_undo_dsack(sk);
 
 		if (!tcp_time_to_recover(sk)) {
-- 
1.7.3.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/5] tcp: skip cwnd moderation in TCP_CA_Open in tcp_try_to_open
  2011-11-16 18:58 [PATCH 1/5] tcp: make is_dupack a parameter to tcp_fastretrans_alert() Neal Cardwell
                   ` (2 preceding siblings ...)
  2011-11-16 18:58 ` [PATCH 4/5] tcp: allow undo from reordered DSACKs Neal Cardwell
@ 2011-11-16 18:58 ` Neal Cardwell
  2011-11-17  5:14   ` Ilpo Järvinen
  2011-11-17  1:23 ` [PATCH 1/5] tcp: make is_dupack a parameter to tcp_fastretrans_alert() Ilpo Järvinen
  4 siblings, 1 reply; 30+ messages in thread
From: Neal Cardwell @ 2011-11-16 18:58 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, ilpo.jarvinen, Nandita Dukkipati, Yuchung Cheng,
	Tom Herbert, Neal Cardwell

The problem: Senders were overriding cwnd values picked during an undo
by calling tcp_moderate_cwnd() in tcp_try_to_open().

The fix: Don't moderate cwnd in tcp_try_to_open() if we're in
TCP_CA_Open, since doing so is generally unnecessary and specifically
would override a DSACK-based undo of a cwnd reduction made in fast
recovery.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
---
 net/ipv4/tcp_input.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a4efdd7..78dd38c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2881,7 +2881,8 @@ static void tcp_try_to_open(struct sock *sk, int flag)
 
 	if (inet_csk(sk)->icsk_ca_state != TCP_CA_CWR) {
 		tcp_try_keep_open(sk);
-		tcp_moderate_cwnd(tp);
+		if (inet_csk(sk)->icsk_ca_state != TCP_CA_Open)
+			tcp_moderate_cwnd(tp);
 	} else {
 		tcp_cwnd_down(sk, flag);
 	}
-- 
1.7.3.1

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] tcp: make is_dupack a parameter to tcp_fastretrans_alert()
  2011-11-16 18:58 [PATCH 1/5] tcp: make is_dupack a parameter to tcp_fastretrans_alert() Neal Cardwell
                   ` (3 preceding siblings ...)
  2011-11-16 18:58 ` [PATCH 5/5] tcp: skip cwnd moderation in TCP_CA_Open in tcp_try_to_open Neal Cardwell
@ 2011-11-17  1:23 ` Ilpo Järvinen
  2011-11-27 23:55   ` David Miller
  4 siblings, 1 reply; 30+ messages in thread
From: Ilpo Järvinen @ 2011-11-17  1:23 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, Netdev, Ilpo Järvinen, Nandita Dukkipati,
	Yuchung Cheng, Tom Herbert

[-- Attachment #1: Type: TEXT/PLAIN, Size: 430 bytes --]

On Wed, 16 Nov 2011, Neal Cardwell wrote:

> Allow callers to decide whether an ACK is a duplicate ACK. This is a
> prerequisite to allowing fastretrans_alert to be called from new
> contexts, such as the no_queue and old_ack code paths, from which we
> have extra info that tells us whether an ACK is a dupack.
> 
> Signed-off-by: Neal Cardwell <ncardwell@google.com>

Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

-- 
 i.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] tcp: use DSACKs that arrive when packets_out is 0
  2011-11-16 18:58 ` [PATCH 2/5] tcp: use DSACKs that arrive when packets_out is 0 Neal Cardwell
@ 2011-11-17  1:34   ` Ilpo Järvinen
  2011-11-27 23:55     ` David Miller
  0 siblings, 1 reply; 30+ messages in thread
From: Ilpo Järvinen @ 2011-11-17  1:34 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, Netdev, Nandita Dukkipati, Yuchung Cheng,
	Tom Herbert

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1345 bytes --]

On Wed, 16 Nov 2011, Neal Cardwell wrote:

> The bug: Senders ignored DSACKs after recovery when there were no
> outstanding packets (a common scenario for HTTP servers).
> 
> The change: when there are no outstanding packets (the "no_queue" goto
> label), call tcp_fastretrans_alert() in order to use DSACKs to undo
> congestion window reductions.
> 
> Other patches in this series will provide other changes that are
> necessary to fully fix this problem.
> 
> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> ---
>  net/ipv4/tcp_input.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index f772aaa..b49e418 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3788,6 +3788,10 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
>  	return 1;
>  
>  no_queue:
> +	/* If data was DSACKed, see if we can undo a cwnd reduction. */
> +	if (flag & FLAG_DSACKING_ACK)
> +		tcp_fastretrans_alert(sk, pkts_acked, newly_acked_sacked,
> +				      is_dupack, flag);

Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

...however, there could be other undo related gotchas now added by this 
patch due to complexity of tcp_fastretrans_alert. It would probably be 
safer to just add the dsack undo related code here?

-- 
 i.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/5] tcp: use SACKs and DSACKs that arrive on ACKs below snd_una
  2011-11-16 18:58 ` [PATCH 3/5] tcp: use SACKs and DSACKs that arrive on ACKs below snd_una Neal Cardwell
@ 2011-11-17  1:52   ` Ilpo Järvinen
  2011-11-27 23:57     ` David Miller
  0 siblings, 1 reply; 30+ messages in thread
From: Ilpo Järvinen @ 2011-11-17  1:52 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, Netdev, Nandita Dukkipati, Yuchung Cheng,
	Tom Herbert

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2105 bytes --]

On Wed, 16 Nov 2011, Neal Cardwell wrote:

> The bug: When the ACK field is below snd_una (which can happen when
> ACKs are reordered), senders ignored DSACKs (preventing undo) and did
> not call tcp_fastretrans_alert, so they did not increment
> prr_delivered to reflect newly-SACKed sequence ranges, and did not
> call tcp_xmit_retransmit_queue, thus passing up chances to send out
> more retransmitted and new packets based on any newly-SACKed packets.
> 
> The change: When the ACK field is below snd_una (the "old_ack" goto
> label), call tcp_fastretrans_alert to allow undo based on any
> newly-arrived DSACKs and try to send out more packets based on
> newly-SACKed packets.
> 
> Other patches in this series will provide other changes that are
> necessary to fully fix this problem.
> 
> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> ---
>  net/ipv4/tcp_input.c |   10 +++++++---
>  1 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index b49e418..751d390 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3805,10 +3805,14 @@ invalid_ack:
>  	return -1;
>  
>  old_ack:
> +	/* If data was SACKed, tag it and see if we should send more data.
> +	 * If data was DSACKed, see if we can undo a cwnd reduction.
> +	 */
>  	if (TCP_SKB_CB(skb)->sacked) {
> -		tcp_sacktag_write_queue(sk, skb, prior_snd_una);
> -		if (icsk->icsk_ca_state == TCP_CA_Open)
> -			tcp_try_keep_open(sk);
> +		flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una);
> +		newly_acked_sacked = tp->sacked_out - prior_sacked;
> +		tcp_fastretrans_alert(sk, pkts_acked, newly_acked_sacked,
> +				      is_dupack, flag);

I gave FLAG_* some thought and couldn't find something that would go wrong 
here (due to goto not all flag enablers are checked for).

Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

...unrelated to the fix, I realized that FRTO is not fully thought through 
in this old ACK case, also its RFC seems to lack considerations on what to 
do in such case. ...I'll need to think the FRTO stuff a bit more.

-- 
 i.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] tcp: skip cwnd moderation in TCP_CA_Open in tcp_try_to_open
  2011-11-16 18:58 ` [PATCH 5/5] tcp: skip cwnd moderation in TCP_CA_Open in tcp_try_to_open Neal Cardwell
@ 2011-11-17  5:14   ` Ilpo Järvinen
  2011-11-19 21:08     ` Neal Cardwell
  0 siblings, 1 reply; 30+ messages in thread
From: Ilpo Järvinen @ 2011-11-17  5:14 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, Netdev, Nandita Dukkipati, Yuchung Cheng,
	Tom Herbert

On Wed, 16 Nov 2011, Neal Cardwell wrote:

> The problem: Senders were overriding cwnd values picked during an undo
> by calling tcp_moderate_cwnd() in tcp_try_to_open().

I think it's intentional. Because of receiver lying bandwidth cheats all 
unlimited undos are bit dangerous.

> The fix: Don't moderate cwnd in tcp_try_to_open() if we're in
> TCP_CA_Open, since doing so is generally unnecessary and specifically
> would override a DSACK-based undo of a cwnd reduction made in fast
> recovery.
> 
> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> ---
>  net/ipv4/tcp_input.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index a4efdd7..78dd38c 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2881,7 +2881,8 @@ static void tcp_try_to_open(struct sock *sk, int flag)
>  
>  	if (inet_csk(sk)->icsk_ca_state != TCP_CA_CWR) {
>  		tcp_try_keep_open(sk);
> -		tcp_moderate_cwnd(tp);
> +		if (inet_csk(sk)->icsk_ca_state != TCP_CA_Open)
> +			tcp_moderate_cwnd(tp);
>  	} else {
>  		tcp_cwnd_down(sk, flag);
>  	}

Wouldn't it be enough if tcp max burst is increased to match IW (iirc we 
had 3 still there as a magic number)?


-- 
 i.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] tcp: allow undo from reordered DSACKs
  2011-11-16 18:58 ` [PATCH 4/5] tcp: allow undo from reordered DSACKs Neal Cardwell
@ 2011-11-17  5:18   ` Ilpo Järvinen
  2011-11-17 20:49     ` Neal Cardwell
  0 siblings, 1 reply; 30+ messages in thread
From: Ilpo Järvinen @ 2011-11-17  5:18 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, Netdev, Nandita Dukkipati, Yuchung Cheng,
	Tom Herbert

On Wed, 16 Nov 2011, Neal Cardwell wrote:

> Previously, SACK-enabled connections hung around in TCP_CA_Disorder
> state while snd_una==high_seq, just waiting to accumulate DSACKs and
> hopefully undo a cwnd reduction. This could and did lead to the
> following unfortunate scenario: if some incoming ACKs advance snd_una
> beyond high_seq then we were setting undo_marker to 0 and moving to
> TCP_CA_Open, so if (due to reordering in the ACK return path) we
> shortly thereafter received a DSACK then we were no longer able to
> undo the cwnd reduction.
> 
> The change: Simplify the congestion avoidance state machine by
> removing the behavior where SACK-enabled connections hung around in
> the TCP_CA_Disorder state just waiting for DSACKs. Instead, when
> snd_una advances to high_seq or beyond we typically move to
> TCP_CA_Open immediately and allow an undo in either TCP_CA_Open or
> TCP_CA_Disorder if we later receive enough DSACKs.
> 
> Other patches in this series will provide other changes that are
> necessary to fully fix this problem.
> 
> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> ---
>  net/ipv4/tcp_input.c |   15 ++-------------
>  1 files changed, 2 insertions(+), 13 deletions(-)
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 751d390..a4efdd7 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2858,7 +2858,7 @@ static void tcp_try_keep_open(struct sock *sk)
>  	struct tcp_sock *tp = tcp_sk(sk);
>  	int state = TCP_CA_Open;
>  
> -	if (tcp_left_out(tp) || tcp_any_retrans_done(sk) || tp->undo_marker)
> +	if (tcp_left_out(tp) || tcp_any_retrans_done(sk))
>  		state = TCP_CA_Disorder;
>  
>  	if (inet_csk(sk)->icsk_ca_state != state) {
> @@ -3066,17 +3066,6 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
>  			}
>  			break;
>  
> -		case TCP_CA_Disorder:
> -			tcp_try_undo_dsack(sk);
> -			if (!tp->undo_marker ||
> -			    /* For SACK case do not Open to allow to undo
> -			     * catching for all duplicate ACKs. */
> -			    tcp_is_reno(tp) || tp->snd_una != tp->high_seq) {
> -				tp->undo_marker = 0;
> -				tcp_set_ca_state(sk, TCP_CA_Open);
> -			}
> -			break;
> -
>  		case TCP_CA_Recovery:
>  			if (tcp_is_reno(tp))
>  				tcp_reset_reno_sack(tp);
> @@ -3117,7 +3106,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
>  				tcp_add_reno_sack(sk);
>  		}
>  
> -		if (icsk->icsk_ca_state == TCP_CA_Disorder)
> +		if (icsk->icsk_ca_state <= TCP_CA_Disorder)
>  			tcp_try_undo_dsack(sk);
>  
>  		if (!tcp_time_to_recover(sk)) {

How about extending Disorder state until second cumulative ACK that is 
acking >= high_seq?

-- 
 i.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] tcp: allow undo from reordered DSACKs
  2011-11-17  5:18   ` Ilpo Järvinen
@ 2011-11-17 20:49     ` Neal Cardwell
  2011-11-22 12:44       ` Ilpo Järvinen
  0 siblings, 1 reply; 30+ messages in thread
From: Neal Cardwell @ 2011-11-17 20:49 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: David Miller, Netdev, Nandita Dukkipati, Yuchung Cheng,
	Tom Herbert

On Thu, Nov 17, 2011 at 12:18 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
> On Wed, 16 Nov 2011, Neal Cardwell wrote:
>
>> Previously, SACK-enabled connections hung around in TCP_CA_Disorder
>> state while snd_una==high_seq, just waiting to accumulate DSACKs and
>> hopefully undo a cwnd reduction. This could and did lead to the
>> following unfortunate scenario: if some incoming ACKs advance snd_una
>> beyond high_seq then we were setting undo_marker to 0 and moving to
>> TCP_CA_Open, so if (due to reordering in the ACK return path) we
>> shortly thereafter received a DSACK then we were no longer able to
>> undo the cwnd reduction.
>>
>> The change: Simplify the congestion avoidance state machine by
>> removing the behavior where SACK-enabled connections hung around in
>> the TCP_CA_Disorder state just waiting for DSACKs. Instead, when
>> snd_una advances to high_seq or beyond we typically move to
>> TCP_CA_Open immediately and allow an undo in either TCP_CA_Open or
>> TCP_CA_Disorder if we later receive enough DSACKs.
>>
>> Other patches in this series will provide other changes that are
>> necessary to fully fix this problem.
>>
>> Signed-off-by: Neal Cardwell <ncardwell@google.com>
>> ---
>>  net/ipv4/tcp_input.c |   15 ++-------------
>>  1 files changed, 2 insertions(+), 13 deletions(-)
>>
>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> index 751d390..a4efdd7 100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
>> @@ -2858,7 +2858,7 @@ static void tcp_try_keep_open(struct sock *sk)
>>       struct tcp_sock *tp = tcp_sk(sk);
>>       int state = TCP_CA_Open;
>>
>> -     if (tcp_left_out(tp) || tcp_any_retrans_done(sk) || tp->undo_marker)
>> +     if (tcp_left_out(tp) || tcp_any_retrans_done(sk))
>>               state = TCP_CA_Disorder;
>>
>>       if (inet_csk(sk)->icsk_ca_state != state) {
>> @@ -3066,17 +3066,6 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
>>                       }
>>                       break;
>>
>> -             case TCP_CA_Disorder:
>> -                     tcp_try_undo_dsack(sk);
>> -                     if (!tp->undo_marker ||
>> -                         /* For SACK case do not Open to allow to undo
>> -                          * catching for all duplicate ACKs. */
>> -                         tcp_is_reno(tp) || tp->snd_una != tp->high_seq) {
>> -                             tp->undo_marker = 0;
>> -                             tcp_set_ca_state(sk, TCP_CA_Open);
>> -                     }
>> -                     break;
>> -
>>               case TCP_CA_Recovery:
>>                       if (tcp_is_reno(tp))
>>                               tcp_reset_reno_sack(tp);
>> @@ -3117,7 +3106,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
>>                               tcp_add_reno_sack(sk);
>>               }
>>
>> -             if (icsk->icsk_ca_state == TCP_CA_Disorder)
>> +             if (icsk->icsk_ca_state <= TCP_CA_Disorder)
>>                       tcp_try_undo_dsack(sk);
>>
>>               if (!tcp_time_to_recover(sk)) {
>
> How about extending Disorder state until second cumulative ACK that is
> acking >= high_seq?

That would seem to add complexity but only provide a partial solution.

This proposed patch has the virtue of providing a general solution
while simplifying the code a little.

What are your concerns with this patch?

Thanks, Ilpo, for taking a look at this patch sequence,
neal

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] tcp: skip cwnd moderation in TCP_CA_Open in tcp_try_to_open
  2011-11-17  5:14   ` Ilpo Järvinen
@ 2011-11-19 21:08     ` Neal Cardwell
  2011-11-20  8:38       ` Ilpo Järvinen
  2011-11-20 10:19       ` Alexander Zimmermann
  0 siblings, 2 replies; 30+ messages in thread
From: Neal Cardwell @ 2011-11-19 21:08 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: David Miller, Netdev, Nandita Dukkipati, Yuchung Cheng,
	Tom Herbert

On Thu, Nov 17, 2011 at 12:14 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:

> I think it's intentional. Because of receiver lying bandwidth cheats all
> unlimited undos are bit dangerous.

If your concern is receivers lying, then there are other easy ways
that a misbehaving receiver can get a sender to send too fast
(e.g. cumulatively ACKing the highest sequence number seen,
irrespective of holes). IMHO it would be a shame to penalize the vast
majority of well-behaved users to plug one potential attack vector
when there are other easy holes that an attacker would use.

> Wouldn't it be enough if tcp max burst is increased to match IW (iirc we
> had 3 still there as a magic number)?

Yes, tcp_max_burst() returns tp->reordering now. Changing it to return
max(tp->reordering, TCP_INIT_CWND) sounds good to me. I think that's
an excellent idea in any case, regardless of the outcome of this undo
discussion.

However, I don't think this is sufficient for request-response
protocols (e.g. RPC) running over long-lived connections over a path
with a large bandwidth-delay product. In such cases, the cwnd will
grow quite large (hundreds of packets). The DSACKs will often arrive
after the entire response is sent, so that cwnd moderation will
typically pull the cwnd down to 3 (or, with your proposal, 10)
packets. IMHO that seems like an unnecessarily steep price to pay.

neal

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] tcp: skip cwnd moderation in TCP_CA_Open in tcp_try_to_open
  2011-11-19 21:08     ` Neal Cardwell
@ 2011-11-20  8:38       ` Ilpo Järvinen
  2011-11-21 22:57         ` Yuchung Cheng
  2011-11-20 10:19       ` Alexander Zimmermann
  1 sibling, 1 reply; 30+ messages in thread
From: Ilpo Järvinen @ 2011-11-20  8:38 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, Netdev, Nandita Dukkipati, Yuchung Cheng,
	Tom Herbert

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2230 bytes --]

On Sat, 19 Nov 2011, Neal Cardwell wrote:

> On Thu, Nov 17, 2011 at 12:14 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> 
> > I think it's intentional. Because of receiver lying bandwidth cheats all
> > unlimited undos are bit dangerous.
> 
> If your concern is receivers lying, then there are other easy ways
> that a misbehaving receiver can get a sender to send too fast
> (e.g. cumulatively ACKing the highest sequence number seen,
> irrespective of holes). IMHO it would be a shame to penalize the vast
> majority of well-behaved users to plug one potential attack vector
> when there are other easy holes that an attacker would use.

I disagree... there is huge difference as the receiver then has to gamble 
with the reliability of the flow. DSACK attack is nasty in that that it
allows maintaining reliability while cheating bw.

> > Wouldn't it be enough if tcp max burst is increased to match IW (iirc we
> > had 3 still there as a magic number)?
> 
> Yes, tcp_max_burst() returns tp->reordering now. Changing it to return
> max(tp->reordering, TCP_INIT_CWND) sounds good to me. I think that's
> an excellent idea in any case, regardless of the outcome of this undo
> discussion.

Feel free to provide a patch then as I'm likely to forget :).

> However, I don't think this is sufficient for request-response
> protocols (e.g. RPC) running over long-lived connections over a path
> with a large bandwidth-delay product. In such cases, the cwnd will
> grow quite large (hundreds of packets). The DSACKs will often arrive
> after the entire response is sent, so that cwnd moderation will
> typically pull the cwnd down to 3 (or, with your proposal, 10)
> packets. IMHO that seems like an unnecessarily steep price to pay.

Sounds rather hypotetical scenario that you'd happen to get those DSACKs 
in such scenario... DSACK would only be sent when receiver and sender were 
out of sync due to heavy ACK loss that exceeds SACK redundancy or heavy 
reordering... the later prevents large window in the first place (or 
reordering has already adapted). Honestly I don't believe this is likely 
scenario. ...And FRTO handles single unnecesary rexmit in case of spurious 
RTO even before DSACK could be seen.

-- 
 i.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] tcp: skip cwnd moderation in TCP_CA_Open in tcp_try_to_open
  2011-11-19 21:08     ` Neal Cardwell
  2011-11-20  8:38       ` Ilpo Järvinen
@ 2011-11-20 10:19       ` Alexander Zimmermann
  1 sibling, 0 replies; 30+ messages in thread
From: Alexander Zimmermann @ 2011-11-20 10:19 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: Ilpo Järvinen, David Miller, netdev, Nandita Dukkipati,
	Yuchung Cheng, Tom Herbert, Carsten Wolff

[-- Attachment #1: Type: text/plain, Size: 2637 bytes --]

Hi Neal,

Am 19.11.2011 um 22:08 schrieb Neal Cardwell:

> On Thu, Nov 17, 2011 at 12:14 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> 
>> I think it's intentional. Because of receiver lying bandwidth cheats all
>> unlimited undos are bit dangerous.
> 
> If your concern is receivers lying, then there are other easy ways
> that a misbehaving receiver can get a sender to send too fast
> (e.g. cumulatively ACKing the highest sequence number seen,
> irrespective of holes). IMHO it would be a shame to penalize the vast
> majority of well-behaved users to plug one potential attack vector
> when there are other easy holes that an attacker would use.
> 
>> Wouldn't it be enough if tcp max burst is increased to match IW (iirc we
>> had 3 still there as a magic number)?
> 
> Yes, tcp_max_burst() returns tp->reordering now. Changing it to return
> max(tp->reordering, TCP_INIT_CWND) sounds good to me. I think that's
> an excellent idea in any case, regardless of the outcome of this undo
> discussion.

For my Phd thesis - based on the Linux reordering detection we implemented
an adaptive Version of TCP-NCR (RFC 4653) - we change tcp_max_burst() to
return TCP_INIT_CWND. If the reordering delay is high, tp->reordering is
high as well, allowing to send a huge burst. In 2008 John Heffner* changed
tcp_max_burst() to return tp->reordering according to problem for reordering
on the reverse path. However, I never was be able to reproduce old problem...
Due to my experience I recommend to use TCP_INIT_CWND directly.

(*)http://git.kernel.org/linus/dd9e0dda66ba38a2ddd140 5ac279894260dc5c36.

> 
> However, I don't think this is sufficient for request-response
> protocols (e.g. RPC) running over long-lived connections over a path
> with a large bandwidth-delay product. In such cases, the cwnd will
> grow quite large (hundreds of packets). The DSACKs will often arrive
> after the entire response is sent, so that cwnd moderation will
> typically pull the cwnd down to 3 (or, with your proposal, 10)
> packets. IMHO that seems like an unnecessarily steep price to pay.
> 
> neal
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22222
// email: zimmermann@cs.rwth-aachen.de
// web: http://www.umic-mesh.net
//


[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 163 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] tcp: skip cwnd moderation in TCP_CA_Open in tcp_try_to_open
  2011-11-20  8:38       ` Ilpo Järvinen
@ 2011-11-21 22:57         ` Yuchung Cheng
  2011-11-22 11:47           ` Ilpo Järvinen
  0 siblings, 1 reply; 30+ messages in thread
From: Yuchung Cheng @ 2011-11-21 22:57 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Neal Cardwell, David Miller, Netdev, Nandita Dukkipati,
	Tom Herbert

On Sun, Nov 20, 2011 at 12:38 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
> On Sat, 19 Nov 2011, Neal Cardwell wrote:
>> If your concern is receivers lying, then there are other easy ways
>> that a misbehaving receiver can get a sender to send too fast
>> (e.g. cumulatively ACKing the highest sequence number seen,
>> irrespective of holes). IMHO it would be a shame to penalize the vast
>> majority of well-behaved users to plug one potential attack vector
>> when there are other easy holes that an attacker would use.
>
> I disagree... there is huge difference as the receiver then has to gamble
> with the reliability of the flow. DSACK attack is nasty in that that it
> allows maintaining reliability while cheating bw.
>

I feel the original change Neal is proposing is getting lost. This
patch proposes when

a. ack is dubious
b. but it's not time to recover yet
c. and other checks in tcp_try_keep_open affirms the network is in
good condition, i.e, stat == Open
3. do not moderate cwnd (not 3, not 10 or any magic number).

Your concern is valid, but if that's true for majority then I argue
all undo logic on dsack or TS are causing major threats since they all
require some trusts of the remote end. Why even bother processing
DSACK if we can't trust them?

AFAIK cwnd moderation is to lower bursty drops not to discourage
(dsack) cheating. I believe the reason is the same in
tcp_try_to_open(). We are in common cases, e.g., loss-recovery, this
logic hurts performance.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] tcp: skip cwnd moderation in TCP_CA_Open in tcp_try_to_open
  2011-11-21 22:57         ` Yuchung Cheng
@ 2011-11-22 11:47           ` Ilpo Järvinen
  2011-11-28  0:01             ` David Miller
  0 siblings, 1 reply; 30+ messages in thread
From: Ilpo Järvinen @ 2011-11-22 11:47 UTC (permalink / raw)
  To: Yuchung Cheng
  Cc: Neal Cardwell, David Miller, Netdev, Nandita Dukkipati,
	Tom Herbert

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2970 bytes --]

On Mon, 21 Nov 2011, Yuchung Cheng wrote:

> On Sun, Nov 20, 2011 at 12:38 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> > On Sat, 19 Nov 2011, Neal Cardwell wrote:
> >> If your concern is receivers lying, then there are other easy ways
> >> that a misbehaving receiver can get a sender to send too fast
> >> (e.g. cumulatively ACKing the highest sequence number seen,
> >> irrespective of holes). IMHO it would be a shame to penalize the vast
> >> majority of well-behaved users to plug one potential attack vector
> >> when there are other easy holes that an attacker would use.
> >
> > I disagree... there is huge difference as the receiver then has to gamble
> > with the reliability of the flow. DSACK attack is nasty in that that it
> > allows maintaining reliability while cheating bw.
> 
> I feel the original change Neal is proposing is getting lost. This patch 
> proposes when
> 
> a. ack is dubious
> b. but it's not time to recover yet
> c. and other checks in tcp_try_keep_open affirms the network is in
> good condition, i.e, stat == Open

Presence of DSACK certainly does disagree with your condition c (which 
was the purpose of this change? Though it could be present in a later 
segment). Network was in such a bad condition that DSACKs were needed in 
the first place.

> 3. do not moderate cwnd (not 3, not 10 or any magic number).
> 
> Your concern is valid, but if that's true for majority then I argue
> all undo logic on dsack or TS are causing major threats since they all
> require some trusts of the remote end. Why even bother processing
> DSACK if we can't trust them?

Right, it seems that DSACK RFCs (RFC 3708 mainly) don't care too much 
about the cheating possibility, just states that it is possible and 
that a non-available nonce solution would help. So yes, I myself find 
DSACK quite useless, albeit fancy, mechanism (there are some other uses 
not related to undo mentioned though in the RFC but I've not spent too 
much thinking the details).

As regards TS, we don't trust them fully either (see e.g., some CUBIC
related change from Stephen Hemminger couple of years ago).

> AFAIK cwnd moderation is to lower bursty drops not to discourage
> (dsack) cheating. I believe the reason is the same in
> tcp_try_to_open(). We are in common cases, e.g., loss-recovery, this
> logic hurts performance.

Quote from the patch description: "Senders were overriding cwnd values 
picked during an undo by calling tcp_moderate_cwnd()" ...so after all it 
has to do with undo being limited. IMHO only up to orig_cwnd/2+IW is safe 
due to cheating opportunities. Also FRTO uses orig_cwnd/2 due to same 
reason (it could do the +IW too but IIRC it is only /2 currently). What 
would be the safeguard there after this one is removed? I kind of see your 
point about this particular line being related to burst mitigation but on 
the same time the end result of removal is that undo becomes potentially 
much more aggressive.

-- 
 i.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] tcp: allow undo from reordered DSACKs
  2011-11-17 20:49     ` Neal Cardwell
@ 2011-11-22 12:44       ` Ilpo Järvinen
  2011-11-27 23:58         ` David Miller
  0 siblings, 1 reply; 30+ messages in thread
From: Ilpo Järvinen @ 2011-11-22 12:44 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, Netdev, Nandita Dukkipati, Yuchung Cheng,
	Tom Herbert

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4409 bytes --]

On Thu, 17 Nov 2011, Neal Cardwell wrote:

> On Thu, Nov 17, 2011 at 12:18 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> > On Wed, 16 Nov 2011, Neal Cardwell wrote:
> >
> >> Previously, SACK-enabled connections hung around in TCP_CA_Disorder
> >> state while snd_una==high_seq, just waiting to accumulate DSACKs and
> >> hopefully undo a cwnd reduction. This could and did lead to the
> >> following unfortunate scenario: if some incoming ACKs advance snd_una
> >> beyond high_seq then we were setting undo_marker to 0 and moving to
> >> TCP_CA_Open, so if (due to reordering in the ACK return path) we
> >> shortly thereafter received a DSACK then we were no longer able to
> >> undo the cwnd reduction.
> >>
> >> The change: Simplify the congestion avoidance state machine by
> >> removing the behavior where SACK-enabled connections hung around in
> >> the TCP_CA_Disorder state just waiting for DSACKs. Instead, when
> >> snd_una advances to high_seq or beyond we typically move to
> >> TCP_CA_Open immediately and allow an undo in either TCP_CA_Open or
> >> TCP_CA_Disorder if we later receive enough DSACKs.
> >>
> >> Other patches in this series will provide other changes that are
> >> necessary to fully fix this problem.
> >>
> >> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> >> ---
> >>  net/ipv4/tcp_input.c |   15 ++-------------
> >>  1 files changed, 2 insertions(+), 13 deletions(-)
> >>
> >> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >> index 751d390..a4efdd7 100644
> >> --- a/net/ipv4/tcp_input.c
> >> +++ b/net/ipv4/tcp_input.c
> >> @@ -2858,7 +2858,7 @@ static void tcp_try_keep_open(struct sock *sk)
> >>       struct tcp_sock *tp = tcp_sk(sk);
> >>       int state = TCP_CA_Open;
> >>
> >> -     if (tcp_left_out(tp) || tcp_any_retrans_done(sk) || tp->undo_marker)
> >> +     if (tcp_left_out(tp) || tcp_any_retrans_done(sk))
> >>               state = TCP_CA_Disorder;
> >>
> >>       if (inet_csk(sk)->icsk_ca_state != state) {
> >> @@ -3066,17 +3066,6 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
> >>                       }
> >>                       break;
> >>
> >> -             case TCP_CA_Disorder:
> >> -                     tcp_try_undo_dsack(sk);
> >> -                     if (!tp->undo_marker ||
> >> -                         /* For SACK case do not Open to allow to undo
> >> -                          * catching for all duplicate ACKs. */
> >> -                         tcp_is_reno(tp) || tp->snd_una != tp->high_seq) {
> >> -                             tp->undo_marker = 0;
> >> -                             tcp_set_ca_state(sk, TCP_CA_Open);
> >> -                     }
> >> -                     break;
> >> -
> >>               case TCP_CA_Recovery:
> >>                       if (tcp_is_reno(tp))
> >>                               tcp_reset_reno_sack(tp);
> >> @@ -3117,7 +3106,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
> >>                               tcp_add_reno_sack(sk);
> >>               }
> >>
> >> -             if (icsk->icsk_ca_state == TCP_CA_Disorder)
> >> +             if (icsk->icsk_ca_state <= TCP_CA_Disorder)
> >>                       tcp_try_undo_dsack(sk);
> >>
> >>               if (!tcp_time_to_recover(sk)) {
> >
> > How about extending Disorder state until second cumulative ACK that is
> > acking >= high_seq?
> 
> That would seem to add complexity but only provide a partial solution.

Right, I forgot the reordering.

> This proposed patch has the virtue of providing a general solution
> while simplifying the code a little.
>
> What are your concerns with this patch?

...I was thinking that if we go to the direction of removal of more and 
more CA_Disorder stuff we could just as well remove the whole state. But 
I've since that time convinced myself that the state itself is still 
necessary even if less action takes place in it after this patch.

Also, one (serious) word of caution! This change, by its extending of 
CA_Open state, is somewhat similar to what I made FRTO years ago, and 
managed to introduces subtle breaking to the state machine. Please check 
that the problem similar to what was fixed by 
a1c1f281b84a751fdb5ff919da3b09df7297619f does not reappear (possibly in 
some other form causing spurious undos). ...To me it seems that 
tcp_packet_delayed might be similarly confused after the given patch.

-- 
 i.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] tcp: make is_dupack a parameter to tcp_fastretrans_alert()
  2011-11-17  1:23 ` [PATCH 1/5] tcp: make is_dupack a parameter to tcp_fastretrans_alert() Ilpo Järvinen
@ 2011-11-27 23:55   ` David Miller
  0 siblings, 0 replies; 30+ messages in thread
From: David Miller @ 2011-11-27 23:55 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: ncardwell, netdev, nanditad, ycheng, therbert

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Thu, 17 Nov 2011 03:23:53 +0200 (EET)

> On Wed, 16 Nov 2011, Neal Cardwell wrote:
> 
>> Allow callers to decide whether an ACK is a duplicate ACK. This is a
>> prerequisite to allowing fastretrans_alert to be called from new
>> contexts, such as the no_queue and old_ack code paths, from which we
>> have extra info that tells us whether an ACK is a dupack.
>> 
>> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> 
> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Applied to net-next.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] tcp: use DSACKs that arrive when packets_out is 0
  2011-11-17  1:34   ` Ilpo Järvinen
@ 2011-11-27 23:55     ` David Miller
  0 siblings, 0 replies; 30+ messages in thread
From: David Miller @ 2011-11-27 23:55 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: ncardwell, netdev, nanditad, ycheng, therbert

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Thu, 17 Nov 2011 03:34:17 +0200 (EET)

> On Wed, 16 Nov 2011, Neal Cardwell wrote:
> 
>> The bug: Senders ignored DSACKs after recovery when there were no
>> outstanding packets (a common scenario for HTTP servers).
>> 
>> The change: when there are no outstanding packets (the "no_queue" goto
>> label), call tcp_fastretrans_alert() in order to use DSACKs to undo
>> congestion window reductions.
>> 
>> Other patches in this series will provide other changes that are
>> necessary to fully fix this problem.
>> 
>> Signed-off-by: Neal Cardwell <ncardwell@google.com>
 ...
> 
> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Applied to net-next.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/5] tcp: use SACKs and DSACKs that arrive on ACKs below snd_una
  2011-11-17  1:52   ` Ilpo Järvinen
@ 2011-11-27 23:57     ` David Miller
  0 siblings, 0 replies; 30+ messages in thread
From: David Miller @ 2011-11-27 23:57 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: ncardwell, netdev, nanditad, ycheng, therbert

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Thu, 17 Nov 2011 03:52:37 +0200 (EET)

> On Wed, 16 Nov 2011, Neal Cardwell wrote:
> 
>> The bug: When the ACK field is below snd_una (which can happen when
>> ACKs are reordered), senders ignored DSACKs (preventing undo) and did
>> not call tcp_fastretrans_alert, so they did not increment
>> prr_delivered to reflect newly-SACKed sequence ranges, and did not
>> call tcp_xmit_retransmit_queue, thus passing up chances to send out
>> more retransmitted and new packets based on any newly-SACKed packets.
>> 
>> The change: When the ACK field is below snd_una (the "old_ack" goto
>> label), call tcp_fastretrans_alert to allow undo based on any
>> newly-arrived DSACKs and try to send out more packets based on
>> newly-SACKed packets.
>> 
>> Other patches in this series will provide other changes that are
>> necessary to fully fix this problem.
>> 
>> Signed-off-by: Neal Cardwell <ncardwell@google.com>
 ...
> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Applied to net-next.

> ...unrelated to the fix, I realized that FRTO is not fully thought through 
> in this old ACK case, also its RFC seems to lack considerations on what to 
> do in such case. ...I'll need to think the FRTO stuff a bit more.

Every feature added to TCP beginning with timestamps has not been well
thought out wrt. reordering. :-)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] tcp: allow undo from reordered DSACKs
  2011-11-22 12:44       ` Ilpo Järvinen
@ 2011-11-27 23:58         ` David Miller
  2011-11-28 17:31           ` Neal Cardwell
  0 siblings, 1 reply; 30+ messages in thread
From: David Miller @ 2011-11-27 23:58 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: ncardwell, netdev, nanditad, ycheng, therbert

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Tue, 22 Nov 2011 14:44:32 +0200 (EET)

> On Thu, 17 Nov 2011, Neal Cardwell wrote:
> 
>> On Thu, Nov 17, 2011 at 12:18 AM, Ilpo Järvinen
>> <ilpo.jarvinen@helsinki.fi> wrote:
>> > On Wed, 16 Nov 2011, Neal Cardwell wrote:
>> >
>> >> Previously, SACK-enabled connections hung around in TCP_CA_Disorder
>> >> state while snd_una==high_seq, just waiting to accumulate DSACKs and
>> >> hopefully undo a cwnd reduction. This could and did lead to the
>> >> following unfortunate scenario: if some incoming ACKs advance snd_una
>> >> beyond high_seq then we were setting undo_marker to 0 and moving to
>> >> TCP_CA_Open, so if (due to reordering in the ACK return path) we
>> >> shortly thereafter received a DSACK then we were no longer able to
>> >> undo the cwnd reduction.
>> >>
>> >> The change: Simplify the congestion avoidance state machine by
>> >> removing the behavior where SACK-enabled connections hung around in
>> >> the TCP_CA_Disorder state just waiting for DSACKs. Instead, when
>> >> snd_una advances to high_seq or beyond we typically move to
>> >> TCP_CA_Open immediately and allow an undo in either TCP_CA_Open or
>> >> TCP_CA_Disorder if we later receive enough DSACKs.
>> >>
>> >> Other patches in this series will provide other changes that are
>> >> necessary to fully fix this problem.
>> >>
>> >> Signed-off-by: Neal Cardwell <ncardwell@google.com>
 ...
>> > How about extending Disorder state until second cumulative ACK that is
>> > acking >= high_seq?
>> 
>> That would seem to add complexity but only provide a partial solution.
> 
> Right, I forgot the reordering.

In order to make forward progress I've added Neil's patch to net-next.

> Also, one (serious) word of caution! This change, by its extending of 
> CA_Open state, is somewhat similar to what I made FRTO years ago, and 
> managed to introduces subtle breaking to the state machine. Please check 
> that the problem similar to what was fixed by 
> a1c1f281b84a751fdb5ff919da3b09df7297619f does not reappear (possibly in 
> some other form causing spurious undos). ...To me it seems that 
> tcp_packet_delayed might be similarly confused after the given patch.

Neil, please look into this so we can have any such issues fixed
in time for the next merge window.

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] tcp: skip cwnd moderation in TCP_CA_Open in tcp_try_to_open
  2011-11-22 11:47           ` Ilpo Järvinen
@ 2011-11-28  0:01             ` David Miller
  2011-11-28 17:35               ` Neal Cardwell
  0 siblings, 1 reply; 30+ messages in thread
From: David Miller @ 2011-11-28  0:01 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: ycheng, ncardwell, netdev, nanditad, therbert

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Tue, 22 Nov 2011 13:47:13 +0200 (EET)

> On Mon, 21 Nov 2011, Yuchung Cheng wrote:
> 
>> AFAIK cwnd moderation is to lower bursty drops not to discourage
>> (dsack) cheating. I believe the reason is the same in
>> tcp_try_to_open(). We are in common cases, e.g., loss-recovery, this
>> logic hurts performance.
> 
> Quote from the patch description: "Senders were overriding cwnd values 
> picked during an undo by calling tcp_moderate_cwnd()" ...so after all it 
> has to do with undo being limited. IMHO only up to orig_cwnd/2+IW is safe 
> due to cheating opportunities. Also FRTO uses orig_cwnd/2 due to same 
> reason (it could do the +IW too but IIRC it is only /2 currently). What 
> would be the safeguard there after this one is removed? I kind of see your 
> point about this particular line being related to burst mitigation but on 
> the same time the end result of removal is that undo becomes potentially 
> much more aggressive.

I'm apply this patch to net-next now, but Neil and Yucheng are on the hook
to fully look into the issues Ilpo has raised.

Thanks.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] tcp: allow undo from reordered DSACKs
  2011-11-27 23:58         ` David Miller
@ 2011-11-28 17:31           ` Neal Cardwell
  2011-12-12 12:05             ` [PATCH 1/1] tcp: fix spurious undos in CA_Open Ilpo Järvinen
  0 siblings, 1 reply; 30+ messages in thread
From: Neal Cardwell @ 2011-11-28 17:31 UTC (permalink / raw)
  To: David Miller; +Cc: ilpo.jarvinen, netdev, nanditad, ycheng, therbert

On Sun, Nov 27, 2011 at 6:58 PM, David Miller <davem@davemloft.net> wrote:
> In order to make forward progress I've added Neil's patch to net-next.

Thanks!

>> Also, one (serious) word of caution! This change, by its extending of
>> CA_Open state, is somewhat similar to what I made FRTO years ago, and
>> managed to introduces subtle breaking to the state machine. Please check
>> that the problem similar to what was fixed by
>> a1c1f281b84a751fdb5ff919da3b09df7297619f does not reappear (possibly in
>> some other form causing spurious undos). ...To me it seems that
>> tcp_packet_delayed might be similarly confused after the given patch.
>
> Neil, please look into this so we can have any such issues fixed
> in time for the next merge window.

Absolutely. I will look into the areas that Ilpo mentioned.

neal

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] tcp: skip cwnd moderation in TCP_CA_Open in tcp_try_to_open
  2011-11-28  0:01             ` David Miller
@ 2011-11-28 17:35               ` Neal Cardwell
  2011-11-28 20:33                 ` Yuchung Cheng
  0 siblings, 1 reply; 30+ messages in thread
From: Neal Cardwell @ 2011-11-28 17:35 UTC (permalink / raw)
  To: David Miller; +Cc: ilpo.jarvinen, ycheng, netdev, nanditad, therbert

On Sun, Nov 27, 2011 at 7:01 PM, David Miller <davem@davemloft.net> wrote:
> From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Tue, 22 Nov 2011 13:47:13 +0200 (EET)
>
>> On Mon, 21 Nov 2011, Yuchung Cheng wrote:
>>
>>> AFAIK cwnd moderation is to lower bursty drops not to discourage
>>> (dsack) cheating. I believe the reason is the same in
>>> tcp_try_to_open(). We are in common cases, e.g., loss-recovery, this
>>> logic hurts performance.
>>
>> Quote from the patch description: "Senders were overriding cwnd values
>> picked during an undo by calling tcp_moderate_cwnd()" ...so after all it
>> has to do with undo being limited. IMHO only up to orig_cwnd/2+IW is safe
>> due to cheating opportunities. Also FRTO uses orig_cwnd/2 due to same
>> reason (it could do the +IW too but IIRC it is only /2 currently). What
>> would be the safeguard there after this one is removed? I kind of see your
>> point about this particular line being related to burst mitigation but on
>> the same time the end result of removal is that undo becomes potentially
>> much more aggressive.
>
> I'm apply this patch to net-next now, but Neil and Yucheng are on the hook
> to fully look into the issues Ilpo has raised.

Thanks! We will spend some time looking into these issues.

neal

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] tcp: skip cwnd moderation in TCP_CA_Open in tcp_try_to_open
  2011-11-28 17:35               ` Neal Cardwell
@ 2011-11-28 20:33                 ` Yuchung Cheng
  2011-11-29 10:33                   ` Ilpo Järvinen
  0 siblings, 1 reply; 30+ messages in thread
From: Yuchung Cheng @ 2011-11-28 20:33 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: David Miller, ilpo.jarvinen, netdev, nanditad, therbert

On Mon, Nov 28, 2011 at 9:35 AM, Neal Cardwell <ncardwell@google.com> wrote:
> On Sun, Nov 27, 2011 at 7:01 PM, David Miller <davem@davemloft.net> wrote:
>> I'm apply this patch to net-next now, but Neil and Yucheng are on the hook
>> to fully look into the issues Ilpo has raised.
>
> Thanks! We will spend some time looking into these issues.
>
Ilpo do you know any report on these types of cheating? I did some
literature search but didn't find any.

If the receiver simply responds DSACK on good (non-spurious)
retransmits, the sender may increase dupthresh (tp->reordering) and
slows down triggering fast-recovery? given the connection has real
losses, the receiver may end up shooting his feet.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/5] tcp: skip cwnd moderation in TCP_CA_Open in tcp_try_to_open
  2011-11-28 20:33                 ` Yuchung Cheng
@ 2011-11-29 10:33                   ` Ilpo Järvinen
  0 siblings, 0 replies; 30+ messages in thread
From: Ilpo Järvinen @ 2011-11-29 10:33 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: Neal Cardwell, David Miller, Netdev, nanditad, therbert

On Mon, 28 Nov 2011, Yuchung Cheng wrote:

> On Mon, Nov 28, 2011 at 9:35 AM, Neal Cardwell <ncardwell@google.com> wrote:
> > On Sun, Nov 27, 2011 at 7:01 PM, David Miller <davem@davemloft.net> wrote:
> >> I'm apply this patch to net-next now, but Neil and Yucheng are on the hook
> >> to fully look into the issues Ilpo has raised.
> >
> > Thanks! We will spend some time looking into these issues.
> >
> Ilpo do you know any report on these types of cheating? I did some
> literature search but didn't find any.

No I know of, however, I'm not too surprised as so far such attacks have 
lacked a target at least among Linux endhosts. It is only possible to gain 
something now after you have one by one removed all safeguards which 
prevent such misuse :-). This possibility is certainly known to exists 
(even RFC3708 mentions it).

Besides, DSACK adoption, not to speak of adoption of any DSACK undo 
algorithm, is not that high as with SACK alone iirc.

> If the receiver simply responds DSACK on good (non-spurious)
> retransmits, the sender may increase dupthresh (tp->reordering) and
> slows down triggering fast-recovery? given the connection has real
> losses, the receiver may end up shooting his feet.

For some attack variants, yes. However, it is always "safe" to attack 
the current Linux sender with something that is past the acknowledgment 
number, there's only dummy counting done for that (trivial to arrange). If 
the attacker plays safe, no tp->reordering effect is ever imposed but 
attack eventually succeeds to hit the right number of undo_retrans no 
matter what if one keeps trying. This is what you just asked us to 
enable, unlimited attack window! Perhaps we could add a penalty of 
misguessing the number of rexmits but knowing the right number is not that 
hard anyway (and some caveats exist in such approach though to those who 
are hit by HW/path duping).

Btw, another thing that should be auditted now that this was applied: 
check that undo_marker checks do not behave bad due to 32-bit seqno 
wrap-arounds since its needs-to-be-valid period got just extented way 
beyond a single loss event.

-- 
 i.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/1] tcp: fix spurious undos in CA_Open
  2011-11-28 17:31           ` Neal Cardwell
@ 2011-12-12 12:05             ` Ilpo Järvinen
  2011-12-12 19:40               ` Neal Cardwell
  0 siblings, 1 reply; 30+ messages in thread
From: Ilpo Järvinen @ 2011-12-12 12:05 UTC (permalink / raw)
  To: Neal Cardwell, David Miller; +Cc: Netdev, nanditad, Yuchung Cheng, therbert

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3838 bytes --]

On Mon, 28 Nov 2011, Neal Cardwell wrote:

> On Sun, Nov 27, 2011 at 6:58 PM, David Miller <davem@davemloft.net> wrote:
>
> >> Also, one (serious) word of caution! This change, by its extending of
> >> CA_Open state, is somewhat similar to what I made FRTO years ago, and
> >> managed to introduces subtle breaking to the state machine. Please check
> >> that the problem similar to what was fixed by
> >> a1c1f281b84a751fdb5ff919da3b09df7297619f does not reappear (possibly in
> >> some other form causing spurious undos). ...To me it seems that
> >> tcp_packet_delayed might be similarly confused after the given patch.
> >
> > Neil, please look into this so we can have any such issues fixed
> > in time for the next merge window.
> 
> Absolutely. I will look into the areas that Ilpo mentioned.

Unfortunately nothing has happened? So I finally had to come up something 
myself. ...Compile tested only.

--
[PATCH 1/1] tcp: fix spurious undos in CA_Open

There's a landmine in tcp_packet_delayed. In CA_Open the
retrans_stamp is very eagerly cleared which falsely triggers
packet-delayed detection. Therefore this code cannot be used
in CA_Open if the retrans_stamp was already cleared.

This became issue once DSACK detection was extended to happen
in CA_Open state too (f698204bd0b, tcp: allow undo from
reordered DSACKs). Essentially the whole congestion control
is disabled because the undos now always restore the previous
window. I wonder if this was already in production... ...at
least the Internet didn't melt ;-).

Compile tested only.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |   20 ++++++++++++++------
 1 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 52b5c2d..7d8934d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2640,8 +2640,14 @@ static void tcp_cwnd_down(struct sock *sk, int flag)
 /* Nothing was retransmitted or returned timestamp is less
  * than timestamp of the first retransmission.
  */
-static inline int tcp_packet_delayed(const struct tcp_sock *tp)
+static inline int tcp_packet_delayed(const struct sock *sk)
 {
+	struct tcp_sock *tp = tcp_sk(sk);
+	const struct inet_connection_sock *icsk = inet_csk(sk);
+
+	if (icsk->icsk_ca_state == TCP_CA_Open && !tp->retrans_stamp)
+		return 0;
+
 	return !tp->retrans_stamp ||
 		(tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
 		 before(tp->rx_opt.rcv_tsecr, tp->retrans_stamp));
@@ -2701,9 +2707,11 @@ static void tcp_undo_cwr(struct sock *sk, const bool undo_ssthresh)
 	tp->snd_cwnd_stamp = tcp_time_stamp;
 }
 
-static inline int tcp_may_undo(const struct tcp_sock *tp)
+static inline int tcp_may_undo(const struct sock *sk)
 {
-	return tp->undo_marker && (!tp->undo_retrans || tcp_packet_delayed(tp));
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	return tp->undo_marker && (!tp->undo_retrans || tcp_packet_delayed(sk));
 }
 
 /* People celebrate: "We love our President!" */
@@ -2711,7 +2719,7 @@ static int tcp_try_undo_recovery(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tcp_may_undo(tp)) {
+	if (tcp_may_undo(sk)) {
 		int mib_idx;
 
 		/* Happy end! We did not retransmit anything
@@ -2788,7 +2796,7 @@ static int tcp_try_undo_partial(struct sock *sk, int acked)
 	/* Partial ACK arrived. Force Hoe's retransmit. */
 	int failed = tcp_is_reno(tp) || (tcp_fackets_out(tp) > tp->reordering);
 
-	if (tcp_may_undo(tp)) {
+	if (tcp_may_undo(sk)) {
 		/* Plain luck! Hole if filled with delayed
 		 * packet, rather than with a retransmit.
 		 */
@@ -2815,7 +2823,7 @@ static int tcp_try_undo_loss(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tcp_may_undo(tp)) {
+	if (tcp_may_undo(sk)) {
 		struct sk_buff *skb;
 		tcp_for_write_queue(skb, sk) {
 			if (skb == tcp_send_head(sk))
-- 
1.7.0.4

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/1] tcp: fix spurious undos in CA_Open
  2011-12-12 12:05             ` [PATCH 1/1] tcp: fix spurious undos in CA_Open Ilpo Järvinen
@ 2011-12-12 19:40               ` Neal Cardwell
  2011-12-14  8:57                 ` Ilpo Järvinen
  0 siblings, 1 reply; 30+ messages in thread
From: Neal Cardwell @ 2011-12-12 19:40 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: David Miller, Netdev, nanditad, Yuchung Cheng, therbert

On Mon, Dec 12, 2011 at 7:05 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
> On Mon, 28 Nov 2011, Neal Cardwell wrote:
>
>> On Sun, Nov 27, 2011 at 6:58 PM, David Miller <davem@davemloft.net> wrote:
>>
>> >> Also, one (serious) word of caution! This change, by its extending of
>> >> CA_Open state, is somewhat similar to what I made FRTO years ago, and
>> >> managed to introduces subtle breaking to the state machine. Please check
>> >> that the problem similar to what was fixed by
>> >> a1c1f281b84a751fdb5ff919da3b09df7297619f does not reappear (possibly in
>> >> some other form causing spurious undos). ...To me it seems that
>> >> tcp_packet_delayed might be similarly confused after the given patch.
>> >
>> > Neil, please look into this so we can have any such issues fixed
>> > in time for the next merge window.
>>
>> Absolutely. I will look into the areas that Ilpo mentioned.
>
> Unfortunately nothing has happened? So I finally had to come up something
> myself. ...Compile tested only.

Sorry for the radio silence, but I have been looking at this. So far
it's been a time-consuming process, as we've uncovered a number of
pre-existing issues in tcp_packet_delayed that seem like quite serious
bugs:

1) tcp_packet_delayed implicitly assumes, but does not verify, that
the ACK in question (whose ECR field precedes retrans_stamp) covers
all retransmitted segments. So if we retransmit segments 1 and 2, and
then get an ACK for 1 that has an ECR indicating it was for the
original copy of segment 1, then we undo,  even in cases where segment
2 really was lost and needed to be retransmitted.

2) tcp_packet_delayed implicitly assumes that if we retransmit a
packet that was truly lost then any ACK for the retransmitted instance
of that packet will echo the timestamp field in the retransmitted
packet. But if the receiver is implementing delayed ACKs and correctly
implements timestamps per RFC 1323, then the receiver will be obeying
this clause: "Thus, when delayed ACKs are in use, the receiver should
reply with the TSval field from the earliest unacknowledged segment."
Thus tcp_packet_delayed can cause spurious undos in the following
circumstances:

- send segment 1 with tsval 10
- receiver is doing delayed ACKs and does not ACK 1 but sets TS.Recent to 10
- send segment 2 with tsval 20
- segment 2 is lost in transit
- RTO, retransmit segment 2 with tsval 30
- receiver sends delayed ACK for packets 1 and 2 with TS.Recent of 10
- sender receives ACK for packet 2 with tsecho of 10, lower than 30,
and thinks RTO was spurious

For better or worse, both of these bugs appear to be inherited from
the Eifel Detection Algorithm RFC (RFC 3522).

In any case, we've been working on coming up with a solution for these issues.

> --
> [PATCH 1/1] tcp: fix spurious undos in CA_Open
>
> There's a landmine in tcp_packet_delayed. In CA_Open the
> retrans_stamp is very eagerly cleared which falsely triggers
> packet-delayed detection. Therefore this code cannot be used
> in CA_Open if the retrans_stamp was already cleared.
>
> This became issue once DSACK detection was extended to happen
> in CA_Open state too (f698204bd0b, tcp: allow undo from
> reordered DSACKs). Essentially the whole congestion control
> is disabled because the undos now always restore the previous
> window. I wonder if this was already in production... ...at
> least the Internet didn't melt ;-).

I'm pretty sure this commit is unnecessary. It seems like a NOP. Note
that tcp_may_undo and tcp_packet_delayed are never called in state
TCP_CA_Open. tcp_may_undo is only called from tcp_try_undo_recovery,
tcp_try_undo_partial, and tcp_try_undo_loss, and to call any of these
functions you must be in either TCP_CA_Loss or TCP_CA_Recovery. Only
tcp_try_undo_dsack is called from TCP_CA_Open, but it makes no
reference to retrans_stamp.

Please let me know if I'm missing something, or if there's a specific
scenario you're concerned about.

thanks,
neal

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/1] tcp: fix spurious undos in CA_Open
  2011-12-12 19:40               ` Neal Cardwell
@ 2011-12-14  8:57                 ` Ilpo Järvinen
  2011-12-18 18:23                   ` Neal Cardwell
  0 siblings, 1 reply; 30+ messages in thread
From: Ilpo Järvinen @ 2011-12-14  8:57 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: David Miller, Netdev, nanditad, Yuchung Cheng, therbert

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5753 bytes --]

On Mon, 12 Dec 2011, Neal Cardwell wrote:

> On Mon, Dec 12, 2011 at 7:05 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> > On Mon, 28 Nov 2011, Neal Cardwell wrote:
> >
> >> On Sun, Nov 27, 2011 at 6:58 PM, David Miller <davem@davemloft.net> wrote:
> >>
> >> >> Also, one (serious) word of caution! This change, by its extending of
> >> >> CA_Open state, is somewhat similar to what I made FRTO years ago, and
> >> >> managed to introduces subtle breaking to the state machine. Please check
> >> >> that the problem similar to what was fixed by
> >> >> a1c1f281b84a751fdb5ff919da3b09df7297619f does not reappear (possibly in
> >> >> some other form causing spurious undos). ...To me it seems that
> >> >> tcp_packet_delayed might be similarly confused after the given patch.
> >> >
> >> > Neil, please look into this so we can have any such issues fixed
> >> > in time for the next merge window.
> >>
> >> Absolutely. I will look into the areas that Ilpo mentioned.
> >
> > Unfortunately nothing has happened? So I finally had to come up something
> > myself. ...Compile tested only.
> 
> Sorry for the radio silence, but I have been looking at this. So far
> it's been a time-consuming process, as we've uncovered a number of
> pre-existing issues in tcp_packet_delayed that seem like quite serious
> bugs:
> 
> 1) tcp_packet_delayed implicitly assumes, but does not verify, that
> the ACK in question (whose ECR field precedes retrans_stamp) covers
> all retransmitted segments. So if we retransmit segments 1 and 2, and
> then get an ACK for 1 that has an ECR indicating it was for the
> original copy of segment 1, then we undo,  even in cases where segment
> 2 really was lost and needed to be retransmitted.

I think this might be an mis-implementation in tcp_try_undo_partial.

> 2) tcp_packet_delayed implicitly assumes that if we retransmit a
> packet that was truly lost then any ACK for the retransmitted instance
> of that packet will echo the timestamp field in the retransmitted
> packet. But if the receiver is implementing delayed ACKs and correctly
> implements timestamps per RFC 1323, then the receiver will be obeying
> this clause: "Thus, when delayed ACKs are in use, the receiver should
> reply with the TSval field from the earliest unacknowledged segment."
> Thus tcp_packet_delayed can cause spurious undos in the following
> circumstances:
>
> - send segment 1 with tsval 10
> - receiver is doing delayed ACKs and does not ACK 1 but sets TS.Recent to 10
> - send segment 2 with tsval 20
> - segment 2 is lost in transit
> - RTO, retransmit segment 2 with tsval 30
> - receiver sends delayed ACK for packets 1 and 2 with TS.Recent of 10
> - sender receives ACK for packet 2 with tsecho of 10, lower than 30,
> and thinks RTO was spurious

About this I'm actually aware of, I have even some long email postponed 
related to this to discuss in context of 1323bis (as 1323bis seems pretty 
much standstill I'm not in a big hurry :-)). I think 1323 algo is flawed 
anyway and it was fixed/clarified/better in the bis (which doesn't 
address this well enough though).

I discovered this while I setup a network with ACK losses to figure out 
other issue. I discovered that the RTT measurements occassionally took 
"the wrong TS" bloating the RTT estimate. As RTT samples included RTO 
delays, with enough ACK losses the RTO eventually hits the max value. 
...As a result, not very nice performance was acquired for that such a 
flow.

I think there are two possible solutions:
  1) Echo latest timestamp for below window packets (I'm not sure if 
  there's can of worms w.r.t. security in this approach). A good thing
  in this would be that it would really solve the rexmit ambiguity problem 
  which the current approach does not do.
  2) Ignore packets with DSACK in RTT measurement.

1323bis is not of much help in this which TS to echo (which is why I 
intented to mention it on tcpm). IIRC, it depended on which takes 
precedence: RFC793 in-window check or the given TS algorithm, which is 
not discussed anywhere.

> For better or worse, both of these bugs appear to be inherited from
> the Eifel Detection Algorithm RFC (RFC 3522).

RFC3522 does not have the issue 1) as it's working after RTO+first ACK 
only. The second issue is valid for it though.

> In any case, we've been working on coming up with a solution for these issues.
> 
> > --
> > [PATCH 1/1] tcp: fix spurious undos in CA_Open
> >
> > There's a landmine in tcp_packet_delayed. In CA_Open the
> > retrans_stamp is very eagerly cleared which falsely triggers
> > packet-delayed detection. Therefore this code cannot be used
> > in CA_Open if the retrans_stamp was already cleared.
> >
> > This became issue once DSACK detection was extended to happen
> > in CA_Open state too (f698204bd0b, tcp: allow undo from
> > reordered DSACKs). Essentially the whole congestion control
> > is disabled because the undos now always restore the previous
> > window. I wonder if this was already in production... ...at
> > least the Internet didn't melt ;-).
> 
> I'm pretty sure this commit is unnecessary. It seems like a NOP. Note
> that tcp_may_undo and tcp_packet_delayed are never called in state
> TCP_CA_Open. tcp_may_undo is only called from tcp_try_undo_recovery,
> tcp_try_undo_partial, and tcp_try_undo_loss, and to call any of these
> functions you must be in either TCP_CA_Loss or TCP_CA_Recovery. Only
> tcp_try_undo_dsack is called from TCP_CA_Open, but it makes no
> reference to retrans_stamp.

It seems that I've missed this difference in tcp_try_undo_dsack (well, 
seen it but never thought it that much). I think you're right. Thanks for 
taking a look.

Some WARN_ON to enforce this !CA_Open condition there might not hurt 
though.

-- 
 i.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/1] tcp: fix spurious undos in CA_Open
  2011-12-14  8:57                 ` Ilpo Järvinen
@ 2011-12-18 18:23                   ` Neal Cardwell
  0 siblings, 0 replies; 30+ messages in thread
From: Neal Cardwell @ 2011-12-18 18:23 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: David Miller, Netdev, nanditad, Yuchung Cheng, therbert

On Wed, Dec 14, 2011 at 3:57 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
> On Mon, 12 Dec 2011, Neal Cardwell wrote:
>
>> On Mon, Dec 12, 2011 at 7:05 AM, Ilpo Järvinen
>> <ilpo.jarvinen@helsinki.fi> wrote:
>> > On Mon, 28 Nov 2011, Neal Cardwell wrote:
>> >
>> >> On Sun, Nov 27, 2011 at 6:58 PM, David Miller <davem@davemloft.net> wrote:
>> >>
>> >> >> Also, one (serious) word of caution! This change, by its extending of
>> >> >> CA_Open state, is somewhat similar to what I made FRTO years ago, and
>> >> >> managed to introduces subtle breaking to the state machine. Please check
>> >> >> that the problem similar to what was fixed by
>> >> >> a1c1f281b84a751fdb5ff919da3b09df7297619f does not reappear (possibly in
>> >> >> some other form causing spurious undos). ...To me it seems that
>> >> >> tcp_packet_delayed might be similarly confused after the given patch.
>> >> >
>> >> > Neil, please look into this so we can have any such issues fixed
>> >> > in time for the next merge window.
>> >>
>> >> Absolutely. I will look into the areas that Ilpo mentioned.
>> >
>> > Unfortunately nothing has happened? So I finally had to come up something
>> > myself. ...Compile tested only.
>>
>> Sorry for the radio silence, but I have been looking at this. So far
>> it's been a time-consuming process, as we've uncovered a number of
>> pre-existing issues in tcp_packet_delayed that seem like quite serious
>> bugs:
>>
>> 1) tcp_packet_delayed implicitly assumes, but does not verify, that
>> the ACK in question (whose ECR field precedes retrans_stamp) covers
>> all retransmitted segments. So if we retransmit segments 1 and 2, and
>> then get an ACK for 1 that has an ECR indicating it was for the
>> original copy of segment 1, then we undo,  even in cases where segment
>> 2 really was lost and needed to be retransmitted.
>
> I think this might be an mis-implementation in tcp_try_undo_partial.

Yeah, I'm pretty sure this is a bug. There needs to be some additional
logic, whether in tcp_try_undo_partial or tcp_packet_delayed, or
somewhere else, to fix this bug.


>> 2) tcp_packet_delayed implicitly assumes that if we retransmit a
>> packet that was truly lost then any ACK for the retransmitted instance
>> of that packet will echo the timestamp field in the retransmitted
>> packet. But if the receiver is implementing delayed ACKs and correctly
>> implements timestamps per RFC 1323, then the receiver will be obeying
>> this clause: "Thus, when delayed ACKs are in use, the receiver should
>> reply with the TSval field from the earliest unacknowledged segment."
>> Thus tcp_packet_delayed can cause spurious undos in the following
>> circumstances:
>>
>> - send segment 1 with tsval 10
>> - receiver is doing delayed ACKs and does not ACK 1 but sets TS.Recent to 10
>> - send segment 2 with tsval 20
>> - segment 2 is lost in transit
>> - RTO, retransmit segment 2 with tsval 30
>> - receiver sends delayed ACK for packets 1 and 2 with TS.Recent of 10
>> - sender receives ACK for packet 2 with tsecho of 10, lower than 30,
>> and thinks RTO was spurious
>
> About this I'm actually aware of, I have even some long email postponed
> related to this to discuss in context of 1323bis (as 1323bis seems pretty
> much standstill I'm not in a big hurry :-)). I think 1323 algo is flawed
> anyway and it was fixed/clarified/better in the bis (which doesn't
> address this well enough though).
>
> I discovered this while I setup a network with ACK losses to figure out
> other issue. I discovered that the RTT measurements occassionally took
> "the wrong TS" bloating the RTT estimate. As RTT samples included RTO
> delays, with enough ACK losses the RTO eventually hits the max value.
> ...As a result, not very nice performance was acquired for that such a
> flow.
>
> I think there are two possible solutions:
>  1) Echo latest timestamp for below window packets (I'm not sure if
>  there's can of worms w.r.t. security in this approach). A good thing
>  in this would be that it would really solve the rexmit ambiguity problem
>  which the current approach does not do.
>  2) Ignore packets with DSACK in RTT measurement.
>
> 1323bis is not of much help in this which TS to echo (which is why I
> intented to mention it on tcpm). IIRC, it depended on which takes
> precedence: RFC793 in-window check or the given TS algorithm, which is
> not discussed anywhere.

Interesting. I hope the standards get improved in this area.


>> For better or worse, both of these bugs appear to be inherited from
>> the Eifel Detection Algorithm RFC (RFC 3522).
>
> RFC3522 does not have the issue 1) as it's working after RTO+first ACK
> only. The second issue is valid for it though.

Actually RFC 3522 does have issue 1) because it is also in effect for
fast retransmit / fast recovery; RFC 3522 says "If the Eifel detection
algorithm is used, the following steps MUST be taken by a TCP sender,
but only upon initiation of loss recovery, i.e., when either the
timeout-based retransmit or the fast retransmit is sent."

In addition, RFC 3517 allows senders to send out multiple
retransmissions upon the first ACK that initiates fast recovery:
Section 5., Algorithm Details, says "(5) In order to take advantage of
potential additional available cwnd, proceed to step (C) below. ...
(C) If cwnd - pipe >= 1 SMSS the sender SHOULD transmit one or more
segments." As a consequence, senders can have multiple retransmissions
in flight at the point they get the first ACK during fast recovery, at
which point they run the Eifel detection algorithm.

So I'm pretty sure 1) is an issue both in the RFCs and in Linux.


>> In any case, we've been working on coming up with a solution for these issues.
>>
>> > --
>> > [PATCH 1/1] tcp: fix spurious undos in CA_Open
>> >
>> > There's a landmine in tcp_packet_delayed. In CA_Open the
>> > retrans_stamp is very eagerly cleared which falsely triggers
>> > packet-delayed detection. Therefore this code cannot be used
>> > in CA_Open if the retrans_stamp was already cleared.
>> >
>> > This became issue once DSACK detection was extended to happen
>> > in CA_Open state too (f698204bd0b, tcp: allow undo from
>> > reordered DSACKs). Essentially the whole congestion control
>> > is disabled because the undos now always restore the previous
>> > window. I wonder if this was already in production... ...at
>> > least the Internet didn't melt ;-).
>>
>> I'm pretty sure this commit is unnecessary. It seems like a NOP. Note
>> that tcp_may_undo and tcp_packet_delayed are never called in state
>> TCP_CA_Open. tcp_may_undo is only called from tcp_try_undo_recovery,
>> tcp_try_undo_partial, and tcp_try_undo_loss, and to call any of these
>> functions you must be in either TCP_CA_Loss or TCP_CA_Recovery. Only
>> tcp_try_undo_dsack is called from TCP_CA_Open, but it makes no
>> reference to retrans_stamp.
>
> It seems that I've missed this difference in tcp_try_undo_dsack (well,
> seen it but never thought it that much). I think you're right. Thanks for
> taking a look.
>
> Some WARN_ON to enforce this !CA_Open condition there might not hurt
> though.

Sounds reasonable. I'll think about this as I work on coming up with a
fix for the 1) and 2) bugs discussed above.

neal

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2011-12-18 18:23 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-16 18:58 [PATCH 1/5] tcp: make is_dupack a parameter to tcp_fastretrans_alert() Neal Cardwell
2011-11-16 18:58 ` [PATCH 2/5] tcp: use DSACKs that arrive when packets_out is 0 Neal Cardwell
2011-11-17  1:34   ` Ilpo Järvinen
2011-11-27 23:55     ` David Miller
2011-11-16 18:58 ` [PATCH 3/5] tcp: use SACKs and DSACKs that arrive on ACKs below snd_una Neal Cardwell
2011-11-17  1:52   ` Ilpo Järvinen
2011-11-27 23:57     ` David Miller
2011-11-16 18:58 ` [PATCH 4/5] tcp: allow undo from reordered DSACKs Neal Cardwell
2011-11-17  5:18   ` Ilpo Järvinen
2011-11-17 20:49     ` Neal Cardwell
2011-11-22 12:44       ` Ilpo Järvinen
2011-11-27 23:58         ` David Miller
2011-11-28 17:31           ` Neal Cardwell
2011-12-12 12:05             ` [PATCH 1/1] tcp: fix spurious undos in CA_Open Ilpo Järvinen
2011-12-12 19:40               ` Neal Cardwell
2011-12-14  8:57                 ` Ilpo Järvinen
2011-12-18 18:23                   ` Neal Cardwell
2011-11-16 18:58 ` [PATCH 5/5] tcp: skip cwnd moderation in TCP_CA_Open in tcp_try_to_open Neal Cardwell
2011-11-17  5:14   ` Ilpo Järvinen
2011-11-19 21:08     ` Neal Cardwell
2011-11-20  8:38       ` Ilpo Järvinen
2011-11-21 22:57         ` Yuchung Cheng
2011-11-22 11:47           ` Ilpo Järvinen
2011-11-28  0:01             ` David Miller
2011-11-28 17:35               ` Neal Cardwell
2011-11-28 20:33                 ` Yuchung Cheng
2011-11-29 10:33                   ` Ilpo Järvinen
2011-11-20 10:19       ` Alexander Zimmermann
2011-11-17  1:23 ` [PATCH 1/5] tcp: make is_dupack a parameter to tcp_fastretrans_alert() Ilpo Järvinen
2011-11-27 23:55   ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).