netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCHv5 0/5] LOST marking rewrite
@ 2007-03-16 18:41 Ilpo Järvinen
  2007-03-16 18:41 ` [RFC PATCHv5 1/5] [TCP]: Add highest_sack seqno, points to globally highest SACK Ilpo Järvinen
  0 siblings, 1 reply; 14+ messages in thread
From: Ilpo Järvinen @ 2007-03-16 18:41 UTC (permalink / raw)
  To: netdev; +Cc: David Miller

It's strange how I always manage to find couple of problems quite soon
after I've sent things to public... :-) Once again two important fixes
to v4.

This patchset provides new LOST detection algorithm. This is part
of tcp_sock diet program. Removes couple of hints, sadly enough this
is a tradeoff as one u32 is necessary for the new algorithm even though
most of the time the last SACK would give all necessary information.

It might be useful to add unlikely to the rexmit hint clearing since it
really shouldn't happen except very very rarely if ever. And maybe also
the !tp->sacked_out.

I think I have introduced a small problem to the timedout thingie,
however, we can work around that by faking highest_sack to a higher
value if timed out loop advances too far to avoid (potentially expensive)
walking... So that it would be highest_sacked_or_lost (or highest_marked,
whatever).

Testing of v5 underway (I'll have full results after couple of hours but
the part I checked already was correct, more trustworthy testing this
time as CBI was not interfering my testset).


    v4-to-v5 changes:
     - Fixed entry_seq when last (and updated) SACK block is less than
       tp->reordering * MSS (previously worked only by a change and
       CBI hid troubles)
     - Delayed holes_seen pre-increment to avoid exiting marker too early
     - Silenced compiler warning RFC3517 fast_rexmit thingie introduces

    v3-to-v4 changes:
     - Fixed FACK off-by-one error that was introduced in v3 change
     - Provided RFC3517 SACK as a separate patch
     - Added "a fastpath" optimization
     - Restructured the algorithm to get it cleaner
     - Decided that there isn't a clear way to handle R without L frames
       timedout thingie above the reord marking (the old code is quite
       impossible to follow anyway as does not have any reliable invariants
       because the timestamps change when -> R transition occurs). Besides,
       such case is probably too marginal one to do something expensive.
       Depending on the hint state, the results could be vastly different
       in the old algo. Thus I did go for a straight forward way: if we have
       a timed out skb, the remaining will also timeout (this could also be
       seen as ok defination since R without L should be made R|L at some
       point!). It could be possible to have some invariants for this if
       there is cheap way to access the skb right after the last rexmit skb
       (I have intentionally avoided hint usage since we're hopefully going
       to farewell them).
     - Handle !sacked_out case correctly
     - Guarantee highest_sack in sync only when sacked_out > 0
     - Rebased to a recent net-2.6.22 (couple of space changes hit the
       removed area & RB-tree fixes were done)
    
    v2-to-v3 changes:
     - Replace of skipping made worries in it obsolete and now really
       one skb less is traversed (was promised in v2)
    
    v1-to-v2 changes:
     - Changes non-fack SACK to something less broken; not a complete
       set of changes (other functions should also be fixed)
     - Fixed NewReno bugs
     - More comments & added RFCs to places I'm unsure of
     - Delayed reord_count increment (traverses now one skb less)
     - Copied retransmit_hint clearing from the original but I
       think it's superfluous
     - Separated highest_sack to own patch
     - Fixed off-by-one bug in skipping




^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCHv5 1/5] [TCP]: Add highest_sack seqno, points to globally highest SACK
  2007-03-16 18:41 [RFC PATCHv5 0/5] LOST marking rewrite Ilpo Järvinen
@ 2007-03-16 18:41 ` Ilpo Järvinen
  2007-03-16 18:41   ` [RFC PATCHv5 2/5] [TCP]: Extracted rexmit hint clearing from the LOST marking code Ilpo Järvinen
  2007-03-25  4:04   ` [RFC PATCHv5 1/5] [TCP]: Add highest_sack seqno, points to globally highest SACK David Miller
  0 siblings, 2 replies; 14+ messages in thread
From: Ilpo Järvinen @ 2007-03-16 18:41 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Ilpo Järvinen

It is guaranteed to be valid only when !tp->sacked_out. In most
cases this seqno is available in the last ACK but there is no
guarantee for that. The new fast recovery loss marking algorithm
needs this as entry point.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/linux/tcp.h  |    2 ++
 net/ipv4/tcp_input.c |    8 +++++++-
 2 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 09f2d66..9e542d5 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -335,6 +335,8 @@ #endif
 
 	struct tcp_sack_block_wire recv_sack_cache[4];
 
+	u32	highest_sack;	/* Start seq of globally highest revd SACK (valid only in slowpath) */
+
 	/* from STCP, retrans queue hinting */
 	struct sk_buff* lost_skb_hint;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9c3b4c7..dbc98ce 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -954,8 +954,10 @@ tcp_sacktag_write_queue(struct sock *sk,
 	int i;
 	int first_sack_index;
 
-	if (!tp->sacked_out)
+	if (!tp->sacked_out) {
 		tp->fackets_out = 0;
+		tp->highest_sack = tp->snd_una;
+	}
 	prior_fackets = tp->fackets_out;
 
 	/* Check for D-SACK. */
@@ -1189,6 +1191,10 @@ tcp_sacktag_write_queue(struct sock *sk,
 
 				if (fack_count > tp->fackets_out)
 					tp->fackets_out = fack_count;
+
+				if (after(TCP_SKB_CB(skb)->seq,
+				    tp->highest_sack))
+					tp->highest_sack = TCP_SKB_CB(skb)->seq;
 			} else {
 				if (dup_sack && (sacked&TCPCB_RETRANS))
 					reord = min(fack_count, reord);
-- 
1.4.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCHv5 2/5] [TCP]: Extracted rexmit hint clearing from the LOST marking code
  2007-03-16 18:41 ` [RFC PATCHv5 1/5] [TCP]: Add highest_sack seqno, points to globally highest SACK Ilpo Järvinen
@ 2007-03-16 18:41   ` Ilpo Järvinen
  2007-03-16 18:41     ` [RFC PATCHv5 3/5] [TCP]: Reworked recovery's TCPCB_LOST marking functions Ilpo Järvinen
  2007-03-25  4:05     ` [RFC PATCHv5 2/5] [TCP]: Extracted rexmit hint clearing from the LOST marking code David Miller
  2007-03-25  4:04   ` [RFC PATCHv5 1/5] [TCP]: Add highest_sack seqno, points to globally highest SACK David Miller
  1 sibling, 2 replies; 14+ messages in thread
From: Ilpo Järvinen @ 2007-03-16 18:41 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Ilpo Järvinen

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |   31 ++++++++++++++++---------------
 1 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index dbc98ce..870404b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1733,6 +1733,20 @@ static inline void tcp_reset_reno_sack(s
 	tp->left_out = tp->lost_out;
 }
 
+/* RFC: This is from the original, I doubt that this is necessary at all:
+ * clear xmit_retrans hint if seq of this skb is beyond hint. How could we
+ * retransmitted past LOST markings in the first place? I'm not fully sure
+ * about undo and end of connection cases, which can cause R without L?
+ */
+static void tcp_verify_retransmit_hint(struct tcp_sock *tp,
+				       struct sk_buff *skb)
+{
+	if ((tp->retransmit_skb_hint != NULL) &&
+	    before(TCP_SKB_CB(skb)->seq,
+	    TCP_SKB_CB(tp->retransmit_skb_hint)->seq))
+		tp->retransmit_skb_hint = skb;
+}
+
 /* Mark head of queue up as lost. */
 static void tcp_mark_head_lost(struct sock *sk, struct tcp_sock *tp,
 			       int packets, u32 high_seq)
@@ -1762,14 +1776,7 @@ static void tcp_mark_head_lost(struct so
 		if (!(TCP_SKB_CB(skb)->sacked&TCPCB_TAGBITS)) {
 			TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
 			tp->lost_out += tcp_skb_pcount(skb);
-
-			/* clear xmit_retransmit_queue hints
-			 *  if this is beyond hint */
-			if (tp->retransmit_skb_hint != NULL &&
-			    before(TCP_SKB_CB(skb)->seq,
-				   TCP_SKB_CB(tp->retransmit_skb_hint)->seq))
-				tp->retransmit_skb_hint = NULL;
-
+			tcp_verify_retransmit_hint(tp, skb);
 		}
 	}
 	tcp_sync_left_out(tp);
@@ -1808,13 +1815,7 @@ static void tcp_update_scoreboard(struct
 			if (!(TCP_SKB_CB(skb)->sacked&TCPCB_TAGBITS)) {
 				TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
 				tp->lost_out += tcp_skb_pcount(skb);
-
-				/* clear xmit_retrans hint */
-				if (tp->retransmit_skb_hint &&
-				    before(TCP_SKB_CB(skb)->seq,
-					   TCP_SKB_CB(tp->retransmit_skb_hint)->seq))
-
-					tp->retransmit_skb_hint = NULL;
+				tcp_verify_retransmit_hint(tp, skb);
 			}
 		}
 
-- 
1.4.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCHv5 3/5] [TCP]: Reworked recovery's TCPCB_LOST marking functions
  2007-03-16 18:41   ` [RFC PATCHv5 2/5] [TCP]: Extracted rexmit hint clearing from the LOST marking code Ilpo Järvinen
@ 2007-03-16 18:41     ` Ilpo Järvinen
  2007-03-16 18:41       ` [RFC PATCHv5 4/5] [TCP]: new LOST marker optimizations Ilpo Järvinen
  2007-03-25  4:09       ` [RFC PATCHv5 3/5] [TCP]: Reworked recovery's TCPCB_LOST marking functions David Miller
  2007-03-25  4:05     ` [RFC PATCHv5 2/5] [TCP]: Extracted rexmit hint clearing from the LOST marking code David Miller
  1 sibling, 2 replies; 14+ messages in thread
From: Ilpo Järvinen @ 2007-03-16 18:41 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Ilpo Järvinen

Complete rewrite for update_scoreboard and mark_head_lost. Couple
of hints became unnecessary because of this change, NewReno code
was greatly simplified. I changed !TCPCB_TAGBITS check from the
original to a !(S|L) check but it shouldn't make a difference,
and if there ever is a R only skb TCP will mark it as LOST too
that is a correct behavior (IMHO). The algorithm uses some ideas
presented by David Miller and Baruch Even.

This patch provides an unoptimized version.

Seqno lookups require fast lookups that are provided using
RB-tree patch (which is nowadays included into the latest
net-2.6.22).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/linux/tcp.h  |    4 -
 include/net/tcp.h    |   11 +++
 net/ipv4/tcp_input.c |  180 ++++++++++++++++++++++++++++++++++++--------------
 3 files changed, 139 insertions(+), 56 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 9e542d5..728321f 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -338,15 +338,11 @@ #endif
 	u32	highest_sack;	/* Start seq of globally highest revd SACK (valid only in slowpath) */
 
 	/* from STCP, retrans queue hinting */
-	struct sk_buff* lost_skb_hint;
-
-	struct sk_buff *scoreboard_skb_hint;
 	struct sk_buff *retransmit_skb_hint;
 	struct sk_buff *forward_skb_hint;
 	struct sk_buff *fastpath_skb_hint;
 
 	int     fastpath_cnt_hint;
-	int     lost_cnt_hint;
 	int     retransmit_cnt_hint;
 	int     forward_cnt_hint;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index b451127..fe1c4f0 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1047,8 +1047,6 @@ static inline void tcp_mib_init(void)
 
 /*from STCP */
 static inline void clear_all_retrans_hints(struct tcp_sock *tp){
-	tp->lost_skb_hint = NULL;
-	tp->scoreboard_skb_hint = NULL;
 	tp->retransmit_skb_hint = NULL;
 	tp->forward_skb_hint = NULL;
 	tp->fastpath_skb_hint = NULL;
@@ -1194,6 +1192,11 @@ static inline struct sk_buff *tcp_write_
 	return skb->next;
 }
 
+static inline struct sk_buff *tcp_write_queue_prev(struct sock *sk, struct sk_buff *skb)
+{
+	return skb->prev;
+}
+
 #define tcp_for_write_queue(skb, sk)					\
 		for (skb = (sk)->sk_write_queue.next;			\
 		     (skb != (struct sk_buff *)&(sk)->sk_write_queue);	\
@@ -1203,6 +1206,10 @@ #define tcp_for_write_queue_from(skb, sk
 		for (; (skb != (struct sk_buff *)&(sk)->sk_write_queue);\
 		     skb = skb->next)
 
+#define tcp_for_write_queue_backwards_from(skb, sk)			\
+		for (; (skb != (struct sk_buff *)&(sk)->sk_write_queue);\
+		     skb = skb->prev)
+
 static inline struct sk_buff *tcp_send_head(struct sock *sk)
 {
 	return sk->sk_send_head;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 870404b..f2b3f68 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1737,92 +1737,172 @@ static inline void tcp_reset_reno_sack(s
  * clear xmit_retrans hint if seq of this skb is beyond hint. How could we
  * retransmitted past LOST markings in the first place? I'm not fully sure
  * about undo and end of connection cases, which can cause R without L?
+ * Anyway, if it was already R (it should be), no need to touch the hint.
  */
 static void tcp_verify_retransmit_hint(struct tcp_sock *tp,
 				       struct sk_buff *skb)
 {
 	if ((tp->retransmit_skb_hint != NULL) &&
+	    !(TCP_SKB_CB(skb)->sacked&TCPCB_SACKED_RETRANS) &&
 	    before(TCP_SKB_CB(skb)->seq,
 	    TCP_SKB_CB(tp->retransmit_skb_hint)->seq))
 		tp->retransmit_skb_hint = skb;
 }
 
-/* Mark head of queue up as lost. */
-static void tcp_mark_head_lost(struct sock *sk, struct tcp_sock *tp,
-			       int packets, u32 high_seq)
+/* Forward walk starting from until a not timedout skb is encountered, timeout
+ * everything on the way
+ */
+static void tcp_timedout_mark_forward(struct sock *sk, struct sk_buff *skb)
 {
-	struct sk_buff *skb;
-	int cnt;
-
-	BUG_TRAP(packets <= tp->packets_out);
-	if (tp->lost_skb_hint) {
-		skb = tp->lost_skb_hint;
-		cnt = tp->lost_cnt_hint;
-	} else {
-		skb = tcp_write_queue_head(sk);
-		cnt = 0;
-	}
+	struct tcp_sock *tp = tcp_sk(sk);
 
 	tcp_for_write_queue_from(skb, sk) {
-		if (skb == tcp_send_head(sk))
+		if (skb == tcp_send_head(sk) || !tcp_skb_timedout(sk, skb))
 			break;
-		/* TODO: do this better */
-		/* this is not the most efficient way to do this... */
-		tp->lost_skb_hint = skb;
-		tp->lost_cnt_hint = cnt;
-		cnt += tcp_skb_pcount(skb);
-		if (cnt > packets || after(TCP_SKB_CB(skb)->end_seq, high_seq))
-			break;
-		if (!(TCP_SKB_CB(skb)->sacked&TCPCB_TAGBITS)) {
+		/* Could be lost already from a previous timedout check */
+		if (!(TCP_SKB_CB(skb)->sacked & TCPCB_LOST)) {
 			TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
 			tp->lost_out += tcp_skb_pcount(skb);
 			tcp_verify_retransmit_hint(tp, skb);
 		}
 	}
-	tcp_sync_left_out(tp);
 }
 
-/* Account newly detected lost packet(s) */
+/* Simple NewReno thing: Mark head LOST if it wasn't yet and it's below
+ * high_seq, stop. That's all.
+ */
+static void tcp_mark_head_lost_single(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb = tcp_write_queue_head(sk);
 
-static void tcp_update_scoreboard(struct sock *sk, struct tcp_sock *tp)
+	if (!(TCP_SKB_CB(skb)->sacked & TCPCB_LOST) &&
+	    before(tp->snd_una, tp->high_seq)) {
+		TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
+		tp->lost_out += tcp_skb_pcount(skb);
+		tcp_sync_left_out(tp);
+	}
+}
+
+/* Marks per skb LOST bits with SACK is enabled.
+ *
+ * Algorithm details
+ *
+ * Walk backwards until the first previously LOST marked skb is found (or head
+ * is encountered) starting from highest_sack seq. Phases:
+ *
+ * Ia)  Condition 1, enough SACK blocks to indicate a loss:
+ *       Skip skbs until reord_count >= tp->reordering, then continue from
+ *       min(skb_now->seq, tp->high_seq) and goto Marker.
+ *
+ * Ib)  Condition 2, skbs timed out:
+ *       During walk, also check if the currently processed skb has timed out.
+ *       If such skb is found, goto Marker (thus if it is found before skipping
+ *       to tp->high_seq in Ia, no skipping will be done).
+ *
+ * II)  Marker
+ *       Mark every non-SACKed skb that is encountered as LOST. When first
+ *       previously LOST marked skb is found, short-circuit.
+ *
+ * III) Polish up
+ *       If no skbs are marked, at least one is marked after the loop.
+ *
+ * IV)  Timedout work to do (if top timedout skb was not encountered)
+ *       If the previous walking didn't reveal the edge after which skbs are
+ *       not timedout, it might be past the work we've done so far. Therefore
+ *       call forward walk that marks timedout skbs as LOST starting from an
+ *       entry point right above a known point provided from the backwards
+ *       walk. Basically the entry point will be next skb after highest_sack
+ *       or high_seq (if TCP did the skip).
+ */
+static void tcp_update_scoreboard_fack(struct sock *sk)
 {
-	if (IsFack(tp)) {
-		int lost = tp->fackets_out - tp->reordering;
-		if (lost <= 0)
-			lost = 1;
-		tcp_mark_head_lost(sk, tp, lost, tp->high_seq);
+	struct tcp_sock *tp = tcp_sk(sk);
+	int timedout_continue = 1;
+	/* Beware: not_marked_skb might be pointing to sk_write_queue */
+	struct sk_buff *not_marked_skb;
+	struct sk_buff *skb;
+
+	if (!tp->sacked_out) {
+		tcp_mark_head_lost_single(sk);
+		not_marked_skb = tcp_write_queue_head(sk);
+		not_marked_skb = tcp_write_queue_next(sk, not_marked_skb);
+
 	} else {
-		tcp_mark_head_lost(sk, tp, 1, tp->high_seq);
-	}
+		unsigned int reord_count = 0;
 
-	/* New heuristics: it is possible only after we switched
-	 * to restart timer each time when something is ACKed.
-	 * Hence, we can detect timed out packets during fast
-	 * retransmit without falling to slow start.
-	 */
-	if (!IsReno(tp) && tcp_head_timedout(sk, tp)) {
-		struct sk_buff *skb;
+		skb = tcp_write_queue_find(sk, tp->highest_sack);
+		/* If this ever becomes expensive, it can be delayed */
+		not_marked_skb = tcp_write_queue_next(sk, skb);
 
-		skb = tp->scoreboard_skb_hint ? tp->scoreboard_skb_hint
-			: tcp_write_queue_head(sk);
+		/* Phase I: Search until TCP can mark */
+		tcp_for_write_queue_backwards_from(skb, sk) {
+			if (TCP_SKB_CB(skb)->sacked & TCPCB_LOST)
+				break;
 
-		tcp_for_write_queue_from(skb, sk) {
-			if (skb == tcp_send_head(sk))
+			if (tcp_skb_timedout(sk, skb))
 				break;
-			if (!tcp_skb_timedout(sk, skb))
+			else
+				timedout_continue = 0;
+
+			/*
+			 * Isn't marked, thus a possible entrypoint (last skb
+			 * before LOST edge but TCP doesn't know for sure yet)
+			 */
+			if (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED))
+				not_marked_skb = skb;
+
+			reord_count += tcp_skb_pcount(skb);
+			if (reord_count > tp->reordering) {
+				if (after(TCP_SKB_CB(skb)->seq, tp->high_seq)) {
+					/* RFC: should we have find_below? */
+					skb = tcp_write_queue_find(sk, tp->high_seq);
+					not_marked_skb = skb;
+					skb = tcp_write_queue_prev(sk, skb);
+					/* Timedout top is again uncertain? */
+					if (tcp_skb_timedout(sk, skb))
+						timedout_continue = 1;
+				}
+				/* ...else:
+				 * RFC: Might have to handle skb fragmentation
+				 * here if reord_count > tp->reordering, which
+				 * can be caused by pcount > 1 in the last
+				 * skb... Original does not handle it, btw.
+				 */
 				break;
+			}
+		}
 
-			if (!(TCP_SKB_CB(skb)->sacked&TCPCB_TAGBITS)) {
+		/* Phase II: Marker */
+		tcp_for_write_queue_backwards_from(skb, sk) {
+			if (TCP_SKB_CB(skb)->sacked & TCPCB_LOST)
+				break;
+			if (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)) {
 				TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
 				tp->lost_out += tcp_skb_pcount(skb);
 				tcp_verify_retransmit_hint(tp, skb);
 			}
 		}
 
-		tp->scoreboard_skb_hint = skb;
-
-		tcp_sync_left_out(tp);
+		/* Phase III: Nothing is still marked?, mark head then */
+		if (!tp->lost_out)
+			tcp_mark_head_lost_single(sk);
 	}
+
+	/* Continue with timedout work */
+	if (timedout_continue)
+		tcp_timedout_mark_forward(sk, not_marked_skb);
+
+	tcp_sync_left_out(tp);
+}
+
+/* Account newly detected lost packet(s) */
+static void tcp_update_scoreboard(struct sock *sk, struct tcp_sock *tp)
+{
+	if (!IsReno(tp))
+		tcp_update_scoreboard_fack(sk);
+	else
+		tcp_mark_head_lost_single(sk);
 }
 
 /* CWND moderation, preventing bursts due to too big ACKs
@@ -2118,7 +2198,7 @@ tcp_fastretrans_alert(struct sock *sk, u
 	    before(tp->snd_una, tp->high_seq) &&
 	    icsk->icsk_ca_state != TCP_CA_Open &&
 	    tp->fackets_out > tp->reordering) {
-		tcp_mark_head_lost(sk, tp, tp->fackets_out-tp->reordering, tp->high_seq);
+	    	tcp_update_scoreboard_fack(sk);
 		NET_INC_STATS_BH(LINUX_MIB_TCPLOSS);
 	}
 
-- 
1.4.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCHv5 4/5] [TCP]: new LOST marker optimizations
  2007-03-16 18:41     ` [RFC PATCHv5 3/5] [TCP]: Reworked recovery's TCPCB_LOST marking functions Ilpo Järvinen
@ 2007-03-16 18:41       ` Ilpo Järvinen
  2007-03-16 18:41         ` [RFC PATCHv5 5/5] [TCP]: non-FACK SACK follows conservative SACK loss recovery (RFC3517) Ilpo Järvinen
  2007-03-25  4:10         ` [RFC PATCHv5 4/5] [TCP]: new LOST marker optimizations David Miller
  2007-03-25  4:09       ` [RFC PATCHv5 3/5] [TCP]: Reworked recovery's TCPCB_LOST marking functions David Miller
  1 sibling, 2 replies; 14+ messages in thread
From: Ilpo Järvinen @ 2007-03-16 18:41 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Ilpo Järvinen

1) Couple of skb states are mutually exclusive

Skb cannot be in both S and L states at the same time, adding
non-S nor L skb count (below highest sack) to that can be
compared against fackets_out to see if anything below the
current skb can still be marked with L or not. If they're
equal (or fackets_out is smaller), the next skb will have L
set for sure.

2) Create a fastpath for the new LOST marker

The fastpath takes advantage of the fact that the latest ACK
very likely contains the globally highest SACK block. By
verifying that is larger than tp->reordering, whole SACK block
can be skipped while counting for not-to-be marked skbs until
tp->reordering of them is encountered. Since TCP now can know,
that should block exists as highest, the marking can begin right
below it. If the latest ACK does not contain a SACK block that
reaches all the way to the highest_sack (e.g., due to
reordering), a slow path is used.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |  139 +++++++++++++++++++++++++++++++++-----------------
 1 files changed, 92 insertions(+), 47 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f2b3f68..d34636b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -936,7 +936,8 @@ #endif
  * account for retransmits accurately.
  */
 static int
-tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_una)
+tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb,
+			u32 prior_snd_una, u32 *mark_lost_entry_seq)
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -957,7 +958,8 @@ tcp_sacktag_write_queue(struct sock *sk,
 	if (!tp->sacked_out) {
 		tp->fackets_out = 0;
 		tp->highest_sack = tp->snd_una;
-	}
+	} else
+		*mark_lost_entry_seq = tp->highest_sack;
 	prior_fackets = tp->fackets_out;
 
 	/* Check for D-SACK. */
@@ -1212,6 +1214,19 @@ tcp_sacktag_write_queue(struct sock *sk,
 				tp->retransmit_skb_hint = NULL;
 			}
 		}
+
+		/* Prepare non-reno LOST marking fast path entry point, the
+		 * last ACK must have the globally highest SACK to use
+		 * fastpath, if the highest SACK block is larger than
+		 * tp->reordering, just skip collecting reord_count from
+		 * it when marking LOSTs later.
+		 */
+		if (!before(end_seq, tp->highest_sack)) {
+			if ((end_seq - start_seq) >= tp->reordering * tp->mss_cache)
+				*mark_lost_entry_seq = start_seq;
+			else
+				*mark_lost_entry_seq = tp->highest_sack;
+		}
 	}
 
 	/* Check for lost retransmit. This superb idea is
@@ -1815,7 +1830,7 @@ static void tcp_mark_head_lost_single(st
  *       walk. Basically the entry point will be next skb after highest_sack
  *       or high_seq (if TCP did the skip).
  */
-static void tcp_update_scoreboard_fack(struct sock *sk)
+static void tcp_update_scoreboard_fack(struct sock *sk, u32 entry_seq)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	int timedout_continue = 1;
@@ -1829,54 +1844,72 @@ static void tcp_update_scoreboard_fack(s
 		not_marked_skb = tcp_write_queue_next(sk, not_marked_skb);
 
 	} else {
-		unsigned int reord_count = 0;
+		unsigned int holes_seen = 0;
+		int reentry_to_highest_sack = 0;
 
-		skb = tcp_write_queue_find(sk, tp->highest_sack);
+		skb = tcp_write_queue_find(sk, entry_seq);
 		/* If this ever becomes expensive, it can be delayed */
 		not_marked_skb = tcp_write_queue_next(sk, skb);
+		if (entry_seq != tp->highest_sack) {
+			/* Not interested in "the last" SACKed one we got */
+			/* RFC: find_below could help here too */
+			skb = tcp_write_queue_prev(sk, skb);
+			/* Delay lookup because it might turn out unnecessary! */
+			reentry_to_highest_sack = 1;
+		} else {
+			unsigned int reord_count = 0;
 
-		/* Phase I: Search until TCP can mark */
-		tcp_for_write_queue_backwards_from(skb, sk) {
-			if (TCP_SKB_CB(skb)->sacked & TCPCB_LOST)
-				break;
-
-			if (tcp_skb_timedout(sk, skb))
-				break;
-			else
-				timedout_continue = 0;
+			/* Phase I: Search until TCP can mark */
+			tcp_for_write_queue_backwards_from(skb, sk) {
+				if ((tp->fackets_out <= tp->sacked_out +
+							tp->lost_out +
+							holes_seen) ||
+				    (TCP_SKB_CB(skb)->sacked & TCPCB_LOST))
+					goto backwards_walk_done;
 
-			/*
-			 * Isn't marked, thus a possible entrypoint (last skb
-			 * before LOST edge but TCP doesn't know for sure yet)
-			 */
-			if (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED))
-				not_marked_skb = skb;
-
-			reord_count += tcp_skb_pcount(skb);
-			if (reord_count > tp->reordering) {
-				if (after(TCP_SKB_CB(skb)->seq, tp->high_seq)) {
-					/* RFC: should we have find_below? */
-					skb = tcp_write_queue_find(sk, tp->high_seq);
-					not_marked_skb = skb;
-					skb = tcp_write_queue_prev(sk, skb);
-					/* Timedout top is again uncertain? */
-					if (tcp_skb_timedout(sk, skb))
-						timedout_continue = 1;
+				if (tcp_skb_timedout(sk, skb))
+					break;
+				else {
+					timedout_continue = 0;
+					reentry_to_highest_sack = 0;
 				}
-				/* ...else:
-				 * RFC: Might have to handle skb fragmentation
-				 * here if reord_count > tp->reordering, which
-				 * can be caused by pcount > 1 in the last
-				 * skb... Original does not handle it, btw.
+
+				/*
+				 * Isn't marked, thus a possible entrypoint
+				 * (last skb before LOST edge but TCP doesn't
+				 * know for sure yet)
 				 */
-				break;
+				if (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED))
+					not_marked_skb = skb;
+
+				reord_count += tcp_skb_pcount(skb);
+				if (reord_count > tp->reordering)
+					break;
+
+				if (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED))
+					holes_seen += tcp_skb_pcount(skb);
 			}
 		}
 
+		if (!tcp_skb_timedout(sk, skb) &&
+		    after(TCP_SKB_CB(skb)->seq, tp->high_seq)) {
+			/* RFC: should we have find_below? */
+			skb = tcp_write_queue_find(sk, tp->high_seq);
+			not_marked_skb = skb;
+			skb = tcp_write_queue_prev(sk, skb);
+			/* Timedout top is again uncertain? */
+			if (tcp_skb_timedout(sk, skb))
+				timedout_continue = 1;
+		}
+		/* RFC: ...else if (!tcp_skb_timedout) do skb fragmentation? */
+
 		/* Phase II: Marker */
 		tcp_for_write_queue_backwards_from(skb, sk) {
-			if (TCP_SKB_CB(skb)->sacked & TCPCB_LOST)
-				break;
+			if ((tp->fackets_out <= tp->sacked_out + tp->lost_out +
+						holes_seen) ||
+			    (TCP_SKB_CB(skb)->sacked & TCPCB_LOST))
+				goto backwards_walk_done;
+
 			if (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)) {
 				TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
 				tp->lost_out += tcp_skb_pcount(skb);
@@ -1887,6 +1920,13 @@ static void tcp_update_scoreboard_fack(s
 		/* Phase III: Nothing is still marked?, mark head then */
 		if (!tp->lost_out)
 			tcp_mark_head_lost_single(sk);
+
+backwards_walk_done:
+		if (timedout_continue && reentry_to_highest_sack) {
+			/* ...do the delayed lookup */
+			skb = tcp_write_queue_find(sk, tp->highest_sack);
+			not_marked_skb = tcp_write_queue_next(sk, skb);
+		}
 	}
 
 	/* Continue with timedout work */
@@ -1897,10 +1937,11 @@ static void tcp_update_scoreboard_fack(s
 }
 
 /* Account newly detected lost packet(s) */
-static void tcp_update_scoreboard(struct sock *sk, struct tcp_sock *tp)
+static void tcp_update_scoreboard(struct sock *sk, struct tcp_sock *tp,
+				  u32 sack_entry_seq)
 {
 	if (!IsReno(tp))
-		tcp_update_scoreboard_fack(sk);
+		tcp_update_scoreboard_fack(sk, sack_entry_seq);
 	else
 		tcp_mark_head_lost_single(sk);
 }
@@ -2170,7 +2211,7 @@ static void tcp_mtup_probe_success(struc
  */
 static void
 tcp_fastretrans_alert(struct sock *sk, u32 prior_snd_una,
-		      int prior_packets, int flag)
+		      int prior_packets, int flag, u32 mark_lost_entry_seq)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -2198,7 +2239,7 @@ tcp_fastretrans_alert(struct sock *sk, u
 	    before(tp->snd_una, tp->high_seq) &&
 	    icsk->icsk_ca_state != TCP_CA_Open &&
 	    tp->fackets_out > tp->reordering) {
-	    	tcp_update_scoreboard_fack(sk);
+	    	tcp_update_scoreboard_fack(sk, mark_lost_entry_seq);
 		NET_INC_STATS_BH(LINUX_MIB_TCPLOSS);
 	}
 
@@ -2324,7 +2365,7 @@ tcp_fastretrans_alert(struct sock *sk, u
 	}
 
 	if (is_dupack || tcp_head_timedout(sk, tp))
-		tcp_update_scoreboard(sk, tp);
+		tcp_update_scoreboard(sk, tp, mark_lost_entry_seq);
 	tcp_cwnd_down(sk);
 	tcp_xmit_retransmit_queue(sk);
 }
@@ -2810,6 +2851,7 @@ static int tcp_ack(struct sock *sk, stru
 	u32 ack_seq = TCP_SKB_CB(skb)->seq;
 	u32 ack = TCP_SKB_CB(skb)->ack_seq;
 	u32 prior_in_flight;
+	u32 mark_lost_entry_seq = tp->snd_una;
 	s32 seq_rtt;
 	int prior_packets;
 	int frto_cwnd = 0;
@@ -2852,7 +2894,8 @@ static int tcp_ack(struct sock *sk, stru
 		flag |= tcp_ack_update_window(sk, tp, skb, ack, ack_seq);
 
 		if (TCP_SKB_CB(skb)->sacked)
-			flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una);
+			flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una,
+							&mark_lost_entry_seq);
 
 		if (TCP_ECN_rcv_ecn_echo(tp, tcp_hdr(skb)))
 			flag |= FLAG_ECE;
@@ -2882,7 +2925,8 @@ static int tcp_ack(struct sock *sk, stru
 		if ((flag & FLAG_DATA_ACKED) && !frto_cwnd &&
 		    tcp_may_raise_cwnd(sk, flag))
 			tcp_cong_avoid(sk, ack,  seq_rtt, prior_in_flight, 0);
-		tcp_fastretrans_alert(sk, prior_snd_una, prior_packets, flag);
+		tcp_fastretrans_alert(sk, prior_snd_una, prior_packets, flag,
+				      mark_lost_entry_seq);
 	} else {
 		if ((flag & FLAG_DATA_ACKED) && !frto_cwnd)
 			tcp_cong_avoid(sk, ack, seq_rtt, prior_in_flight, 1);
@@ -2906,7 +2950,8 @@ no_queue:
 
 old_ack:
 	if (TCP_SKB_CB(skb)->sacked)
-		tcp_sacktag_write_queue(sk, skb, prior_snd_una);
+		tcp_sacktag_write_queue(sk, skb, prior_snd_una,
+					&mark_lost_entry_seq);
 
 uninteresting_ack:
 	SOCK_DEBUG(sk, "Ack %u out of %u:%u\n", ack, tp->snd_una, tp->snd_nxt);
-- 
1.4.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCHv5 5/5] [TCP]: non-FACK SACK follows conservative SACK loss recovery (RFC3517)
  2007-03-16 18:41       ` [RFC PATCHv5 4/5] [TCP]: new LOST marker optimizations Ilpo Järvinen
@ 2007-03-16 18:41         ` Ilpo Järvinen
  2007-03-25  4:13           ` David Miller
  2007-03-25  4:10         ` [RFC PATCHv5 4/5] [TCP]: new LOST marker optimizations David Miller
  1 sibling, 1 reply; 14+ messages in thread
From: Ilpo Järvinen @ 2007-03-16 18:41 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Ilpo Järvinen

Many assumptions that are true when no reordering or other
strange events happen are not a part of the RFC3517. FACK
implementation is based on such assumptions. Previously (before
the rewrite) the non-FACK SACK was basically doing fast rexmit
and then it times out all skbs when first cumulative ACK arrives,
which cannot really be called SACK based recovery :-).

RFC3517 SACK disables these things:
- Per SKB timeouts & head timeout entry to recovery
- Marking at least one skb while in recovery (RFC3517 does this
  only for the fast retransmission but not for the other skbs
  when cumulative ACKs arrive in the recovery)
- B & C flavors of loss detection (see comment before
  tcp_sacktag_write_queue)

This does not implement the "last resort" rule 3 of NextSeg, which
allows retransmissions also when not enough SACK blocks have yet
arrived above a segment for IsLost to return true [RFC3517].

The implementation differs from RFC3517 in two points:
- Rate-halving is used instead of FlightSize / 2
- Instead of using dupACKs to trigger the recovery, the number
  of SACK blocks is used as FACK does with SACK blocks+holes
  (which provides more accurate number). It seems that the
  difference can affect negatively only if the receiver does not
  generate SACK blocks at all even though it claimed to be
  SACK-capable.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |   59 +++++++++++++++++++++++++++++++-------------------
 1 files changed, 37 insertions(+), 22 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d34636b..a1b2e14 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -111,6 +111,7 @@ #define FLAG_FORWARD_PROGRESS	(FLAG_ACKE
 #define IsReno(tp) ((tp)->rx_opt.sack_ok == 0)
 #define IsFack(tp) ((tp)->rx_opt.sack_ok & 2)
 #define IsDSack(tp) ((tp)->rx_opt.sack_ok & 4)
+#define Is3517Sack(tp) (!IsReno(tp) && !IsFack(tp))
 
 #define IsSackFrto() (sysctl_tcp_frto == 0x2)
 
@@ -1235,7 +1236,7 @@ tcp_sacktag_write_queue(struct sock *sk,
 	 * we have to account for reordering! Ugly,
 	 * but should help.
 	 */
-	if (lost_retrans && icsk->icsk_ca_state == TCP_CA_Recovery) {
+	if (IsFack(tp) && lost_retrans && icsk->icsk_ca_state == TCP_CA_Recovery) {
 		struct sk_buff *skb;
 
 		tcp_for_write_queue(skb, sk) {
@@ -1554,7 +1555,8 @@ static int tcp_check_sack_reneging(struc
 
 static inline int tcp_fackets_out(struct tcp_sock *tp)
 {
-	return IsReno(tp) ? tp->sacked_out+1 : tp->fackets_out;
+	return (IsReno(tp) || Is3517Sack(tp)) ? tp->sacked_out + 1 :
+						tp->fackets_out;
 }
 
 static inline int tcp_skb_timedout(struct sock *sk, struct sk_buff *skb)
@@ -1680,7 +1682,7 @@ static int tcp_time_to_recover(struct so
 	/* Trick#3 : when we use RFC2988 timer restart, fast
 	 * retransmit can be triggered by timeout of queue head.
 	 */
-	if (tcp_head_timedout(sk, tp))
+	if (IsFack(tp) && tcp_head_timedout(sk, tp))
 		return 1;
 
 	/* Trick#4: It is still not OK... But will it be useful to delay
@@ -1829,24 +1831,33 @@ static void tcp_mark_head_lost_single(st
  *       entry point right above a known point provided from the backwards
  *       walk. Basically the entry point will be next skb after highest_sack
  *       or high_seq (if TCP did the skip).
+ *
+ * Implements both FACK and RFC3517 SACK. Only FACK does per skb timeouts.
+ * Key difference here is that FACK uses both SACK blocks and holes while
+ * RFC3517 is using only SACK blocks when counting for reordering.
  */
-static void tcp_update_scoreboard_fack(struct sock *sk, u32 entry_seq)
+static void tcp_update_scoreboard_fack(struct sock *sk, u32 entry_seq,
+				       int fast_rexmit)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	int timedout_continue = 1;
+	int timedout_continue = 0;
 	/* Beware: not_marked_skb might be pointing to sk_write_queue */
 	struct sk_buff *not_marked_skb;
 	struct sk_buff *skb;
 
 	if (!tp->sacked_out) {
-		tcp_mark_head_lost_single(sk);
-		not_marked_skb = tcp_write_queue_head(sk);
-		not_marked_skb = tcp_write_queue_next(sk, not_marked_skb);
+		if (IsFack(tp) || fast_rexmit) {
+			tcp_mark_head_lost_single(sk);
+			not_marked_skb = tcp_write_queue_head(sk);
+			not_marked_skb = tcp_write_queue_next(sk, not_marked_skb);
+			timedout_continue = 1;
+		}
 
 	} else {
 		unsigned int holes_seen = 0;
 		int reentry_to_highest_sack = 0;
 
+		timedout_continue = 1;
 		skb = tcp_write_queue_find(sk, entry_seq);
 		/* If this ever becomes expensive, it can be delayed */
 		not_marked_skb = tcp_write_queue_next(sk, skb);
@@ -1857,7 +1868,7 @@ static void tcp_update_scoreboard_fack(s
 			/* Delay lookup because it might turn out unnecessary! */
 			reentry_to_highest_sack = 1;
 		} else {
-			unsigned int reord_count = 0;
+			unsigned int reord_count = IsFack(tp) ? 0 : 1;
 
 			/* Phase I: Search until TCP can mark */
 			tcp_for_write_queue_backwards_from(skb, sk) {
@@ -1867,7 +1878,7 @@ static void tcp_update_scoreboard_fack(s
 				    (TCP_SKB_CB(skb)->sacked & TCPCB_LOST))
 					goto backwards_walk_done;
 
-				if (tcp_skb_timedout(sk, skb))
+				if (IsFack(tp) && tcp_skb_timedout(sk, skb))
 					break;
 				else {
 					timedout_continue = 0;
@@ -1882,7 +1893,9 @@ static void tcp_update_scoreboard_fack(s
 				if (!(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED))
 					not_marked_skb = skb;
 
-				reord_count += tcp_skb_pcount(skb);
+				if (IsFack(tp) ||
+				    (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED))
+					reord_count += tcp_skb_pcount(skb);
 				if (reord_count > tp->reordering)
 					break;
 
@@ -1891,7 +1904,7 @@ static void tcp_update_scoreboard_fack(s
 			}
 		}
 
-		if (!tcp_skb_timedout(sk, skb) &&
+		if ((!IsFack(tp) || !tcp_skb_timedout(sk, skb)) &&
 		    after(TCP_SKB_CB(skb)->seq, tp->high_seq)) {
 			/* RFC: should we have find_below? */
 			skb = tcp_write_queue_find(sk, tp->high_seq);
@@ -1918,11 +1931,11 @@ static void tcp_update_scoreboard_fack(s
 		}
 
 		/* Phase III: Nothing is still marked?, mark head then */
-		if (!tp->lost_out)
+		if ((IsFack(tp) || fast_rexmit) && !tp->lost_out)
 			tcp_mark_head_lost_single(sk);
 
 backwards_walk_done:
-		if (timedout_continue && reentry_to_highest_sack) {
+		if (IsFack(tp) && timedout_continue && reentry_to_highest_sack) {
 			/* ...do the delayed lookup */
 			skb = tcp_write_queue_find(sk, tp->highest_sack);
 			not_marked_skb = tcp_write_queue_next(sk, skb);
@@ -1930,7 +1943,7 @@ backwards_walk_done:
 	}
 
 	/* Continue with timedout work */
-	if (timedout_continue)
+	if (IsFack(tp) && timedout_continue)
 		tcp_timedout_mark_forward(sk, not_marked_skb);
 
 	tcp_sync_left_out(tp);
@@ -1938,10 +1951,10 @@ backwards_walk_done:
 
 /* Account newly detected lost packet(s) */
 static void tcp_update_scoreboard(struct sock *sk, struct tcp_sock *tp,
-				  u32 sack_entry_seq)
+				  u32 sack_entry_seq, int fast_rexmit)
 {
 	if (!IsReno(tp))
-		tcp_update_scoreboard_fack(sk, sack_entry_seq);
+		tcp_update_scoreboard_fack(sk, sack_entry_seq, fast_rexmit);
 	else
 		tcp_mark_head_lost_single(sk);
 }
@@ -2085,7 +2098,7 @@ static int tcp_try_undo_partial(struct s
 				int acked)
 {
 	/* Partial ACK arrived. Force Hoe's retransmit. */
-	int failed = IsReno(tp) || tp->fackets_out>tp->reordering;
+	int failed = IsReno(tp) || (tcp_fackets_out(tp) > tp->reordering);
 
 	if (tcp_may_undo(tp)) {
 		/* Plain luck! Hole if filled with delayed
@@ -2216,6 +2229,7 @@ tcp_fastretrans_alert(struct sock *sk, u
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	int is_dupack = (tp->snd_una == prior_snd_una && !(flag&FLAG_NOT_DUP));
+	int fast_rexmit = 0;
 
 	/* Some technical things:
 	 * 1. Reno does not count dupacks (sacked_out) automatically. */
@@ -2235,11 +2249,11 @@ tcp_fastretrans_alert(struct sock *sk, u
 		return;
 
 	/* C. Process data loss notification, provided it is valid. */
-	if ((flag&FLAG_DATA_LOST) &&
+	if (IsFack(tp) && (flag&FLAG_DATA_LOST) &&
 	    before(tp->snd_una, tp->high_seq) &&
 	    icsk->icsk_ca_state != TCP_CA_Open &&
 	    tp->fackets_out > tp->reordering) {
-	    	tcp_update_scoreboard_fack(sk, mark_lost_entry_seq);
+	    	tcp_update_scoreboard_fack(sk, mark_lost_entry_seq, 0);
 		NET_INC_STATS_BH(LINUX_MIB_TCPLOSS);
 	}
 
@@ -2362,10 +2376,11 @@ tcp_fastretrans_alert(struct sock *sk, u
 		tp->bytes_acked = 0;
 		tp->snd_cwnd_cnt = 0;
 		tcp_set_ca_state(sk, TCP_CA_Recovery);
+		fast_rexmit = 1;
 	}
 
-	if (is_dupack || tcp_head_timedout(sk, tp))
-		tcp_update_scoreboard(sk, tp, mark_lost_entry_seq);
+	if (is_dupack || (IsFack(tp) && tcp_head_timedout(sk, tp)))
+		tcp_update_scoreboard(sk, tp, mark_lost_entry_seq, fast_rexmit);
 	tcp_cwnd_down(sk);
 	tcp_xmit_retransmit_queue(sk);
 }
-- 
1.4.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCHv5 1/5] [TCP]: Add highest_sack seqno, points to globally highest SACK
  2007-03-16 18:41 ` [RFC PATCHv5 1/5] [TCP]: Add highest_sack seqno, points to globally highest SACK Ilpo Järvinen
  2007-03-16 18:41   ` [RFC PATCHv5 2/5] [TCP]: Extracted rexmit hint clearing from the LOST marking code Ilpo Järvinen
@ 2007-03-25  4:04   ` David Miller
  2007-03-25 21:55     ` Ilpo Järvinen
  1 sibling, 1 reply; 14+ messages in thread
From: David Miller @ 2007-03-25  4:04 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Fri, 16 Mar 2007 20:41:02 +0200

> It is guaranteed to be valid only when !tp->sacked_out. In most
> cases this seqno is available in the last ACK but there is no
> guarantee for that. The new fast recovery loss marking algorithm
> needs this as entry point.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Patch applied, thanks Ilpo.

It's a shame we keep around multiple values which represent very
related values.  For example, this new seqno could be computed
using "fackets_out" if we knew also how many holes there were.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCHv5 2/5] [TCP]: Extracted rexmit hint clearing from the LOST marking code
  2007-03-16 18:41   ` [RFC PATCHv5 2/5] [TCP]: Extracted rexmit hint clearing from the LOST marking code Ilpo Järvinen
  2007-03-16 18:41     ` [RFC PATCHv5 3/5] [TCP]: Reworked recovery's TCPCB_LOST marking functions Ilpo Järvinen
@ 2007-03-25  4:05     ` David Miller
  1 sibling, 0 replies; 14+ messages in thread
From: David Miller @ 2007-03-25  4:05 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Fri, 16 Mar 2007 20:41:03 +0200

> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Applied.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCHv5 3/5] [TCP]: Reworked recovery's TCPCB_LOST marking functions
  2007-03-16 18:41     ` [RFC PATCHv5 3/5] [TCP]: Reworked recovery's TCPCB_LOST marking functions Ilpo Järvinen
  2007-03-16 18:41       ` [RFC PATCHv5 4/5] [TCP]: new LOST marker optimizations Ilpo Järvinen
@ 2007-03-25  4:09       ` David Miller
  1 sibling, 0 replies; 14+ messages in thread
From: David Miller @ 2007-03-25  4:09 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Fri, 16 Mar 2007 20:41:04 +0200

> Complete rewrite for update_scoreboard and mark_head_lost. Couple
> of hints became unnecessary because of this change, NewReno code
> was greatly simplified. I changed !TCPCB_TAGBITS check from the
> original to a !(S|L) check but it shouldn't make a difference,
> and if there ever is a R only skb TCP will mark it as LOST too
> that is a correct behavior (IMHO). The algorithm uses some ideas
> presented by David Miller and Baruch Even.
> 
> This patch provides an unoptimized version.
> 
> Seqno lookups require fast lookups that are provided using
> RB-tree patch (which is nowadays included into the latest
> net-2.6.22).
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Ok, sorry it took so long to review this, but it looks like it
will work, patch applied.

Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCHv5 4/5] [TCP]: new LOST marker optimizations
  2007-03-16 18:41       ` [RFC PATCHv5 4/5] [TCP]: new LOST marker optimizations Ilpo Järvinen
  2007-03-16 18:41         ` [RFC PATCHv5 5/5] [TCP]: non-FACK SACK follows conservative SACK loss recovery (RFC3517) Ilpo Järvinen
@ 2007-03-25  4:10         ` David Miller
  1 sibling, 0 replies; 14+ messages in thread
From: David Miller @ 2007-03-25  4:10 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Fri, 16 Mar 2007 20:41:05 +0200

> 1) Couple of skb states are mutually exclusive
> 
> Skb cannot be in both S and L states at the same time, adding
> non-S nor L skb count (below highest sack) to that can be
> compared against fackets_out to see if anything below the
> current skb can still be marked with L or not. If they're
> equal (or fackets_out is smaller), the next skb will have L
> set for sure.
> 
> 2) Create a fastpath for the new LOST marker
> 
> The fastpath takes advantage of the fact that the latest ACK
> very likely contains the globally highest SACK block. By
> verifying that is larger than tp->reordering, whole SACK block
> can be skipped while counting for not-to-be marked skbs until
> tp->reordering of them is encountered. Since TCP now can know,
> that should block exists as highest, the marking can begin right
> below it. If the latest ACK does not contain a SACK block that
> reaches all the way to the highest_sack (e.g., due to
> reordering), a slow path is used.

Excellent observation. :)

> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Patch applied, thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCHv5 5/5] [TCP]: non-FACK SACK follows conservative SACK loss recovery (RFC3517)
  2007-03-16 18:41         ` [RFC PATCHv5 5/5] [TCP]: non-FACK SACK follows conservative SACK loss recovery (RFC3517) Ilpo Järvinen
@ 2007-03-25  4:13           ` David Miller
  0 siblings, 0 replies; 14+ messages in thread
From: David Miller @ 2007-03-25  4:13 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Fri, 16 Mar 2007 20:41:06 +0200

> Many assumptions that are true when no reordering or other
> strange events happen are not a part of the RFC3517. FACK
> implementation is based on such assumptions. Previously (before
> the rewrite) the non-FACK SACK was basically doing fast rexmit
> and then it times out all skbs when first cumulative ACK arrives,
> which cannot really be called SACK based recovery :-).
> 
> RFC3517 SACK disables these things:
> - Per SKB timeouts & head timeout entry to recovery
> - Marking at least one skb while in recovery (RFC3517 does this
>   only for the fast retransmission but not for the other skbs
>   when cumulative ACKs arrive in the recovery)
> - B & C flavors of loss detection (see comment before
>   tcp_sacktag_write_queue)
> 
> This does not implement the "last resort" rule 3 of NextSeg, which
> allows retransmissions also when not enough SACK blocks have yet
> arrived above a segment for IsLost to return true [RFC3517].
> 
> The implementation differs from RFC3517 in two points:
> - Rate-halving is used instead of FlightSize / 2
> - Instead of using dupACKs to trigger the recovery, the number
>   of SACK blocks is used as FACK does with SACK blocks+holes
>   (which provides more accurate number). It seems that the
>   difference can affect negatively only if the receiver does not
>   generate SACK blocks at all even though it claimed to be
>   SACK-capable.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

I'm OK with this one too, applied, thanks a lot!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCHv5 1/5] [TCP]: Add highest_sack seqno, points to globally highest SACK
  2007-03-25  4:04   ` [RFC PATCHv5 1/5] [TCP]: Add highest_sack seqno, points to globally highest SACK David Miller
@ 2007-03-25 21:55     ` Ilpo Järvinen
  2007-03-26  1:35       ` David Miller
  0 siblings, 1 reply; 14+ messages in thread
From: Ilpo Järvinen @ 2007-03-25 21:55 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 945 bytes --]

On Sat, 24 Mar 2007, David Miller wrote:

> From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Fri, 16 Mar 2007 20:41:02 +0200
> 
> > It is guaranteed to be valid only when !tp->sacked_out. In most
> > cases this seqno is available in the last ACK but there is no
> > guarantee for that. The new fast recovery loss marking algorithm
> > needs this as entry point.
> 
> It's a shame we keep around multiple values which represent very
> related values.  For example, this new seqno could be computed
> using "fackets_out" if we knew also how many holes there were.

It would then be an estimate which is 100% accurate in most cases (when 
all related skbs are full sized). I think it must never underestimate 
highest_sack though, it would be quite ok to use fackets_out * mss but any 
> mss skb would obviously be a problem (if they ever occur)... Are all 
paths ok with that (this is beoynd the codepaths I'm quite sure of...)?!

-- 
 i.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCHv5 1/5] [TCP]: Add highest_sack seqno, points to globally highest SACK
  2007-03-25 21:55     ` Ilpo Järvinen
@ 2007-03-26  1:35       ` David Miller
  2007-03-26 21:43         ` Ilpo Järvinen
  0 siblings, 1 reply; 14+ messages in thread
From: David Miller @ 2007-03-26  1:35 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Mon, 26 Mar 2007 00:55:57 +0300 (EEST)

> On Sat, 24 Mar 2007, David Miller wrote:
> 
> > From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> > Date: Fri, 16 Mar 2007 20:41:02 +0200
> > 
> > > It is guaranteed to be valid only when !tp->sacked_out. In most
> > > cases this seqno is available in the last ACK but there is no
> > > guarantee for that. The new fast recovery loss marking algorithm
> > > needs this as entry point.
> > 
> > It's a shame we keep around multiple values which represent very
> > related values.  For example, this new seqno could be computed
> > using "fackets_out" if we knew also how many holes there were.
> 
> It would then be an estimate which is 100% accurate in most cases (when 
> all related skbs are full sized). I think it must never underestimate 
> highest_sack though, it would be quite ok to use fackets_out * mss but any 
> > mss skb would obviously be a problem (if they ever occur)... Are all 
> paths ok with that (this is beoynd the codepaths I'm quite sure of...)?!

Right, I understand that we really can't even use such an estimate
because of the "full mss size" assumption.

In any event, maybe we can find another way to "integrate" other
values we maintain in tcp_sock we can be computed from other ones
cheaply.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCHv5 1/5] [TCP]: Add highest_sack seqno, points to globally highest SACK
  2007-03-26  1:35       ` David Miller
@ 2007-03-26 21:43         ` Ilpo Järvinen
  0 siblings, 0 replies; 14+ messages in thread
From: Ilpo Järvinen @ 2007-03-26 21:43 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2472 bytes --]

On Sun, 25 Mar 2007, David Miller wrote:

> From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Mon, 26 Mar 2007 00:55:57 +0300 (EEST)
> 
> > On Sat, 24 Mar 2007, David Miller wrote:
> > 
> > > From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> > >
> > > > It is guaranteed to be valid only when !tp->sacked_out. In most
> > > > cases this seqno is available in the last ACK but there is no
> > > > guarantee for that. The new fast recovery loss marking algorithm
> > > > needs this as entry point.
> > > 
> > > It's a shame we keep around multiple values which represent very
> > > related values.  For example, this new seqno could be computed
> > > using "fackets_out" if we knew also how many holes there were.
> > 
> > It would then be an estimate which is 100% accurate in most cases (when 
> > all related skbs are full sized). I think it must never underestimate 
> > highest_sack though, it would be quite ok to use fackets_out * mss but
> > any > mss skb would obviously be a problem (if they ever occur)... 
> > Are all paths ok with that (this is beoynd the codepaths I'm quite 
> > sure of...)?!

I was being overly broad here, of course these happen with gso, but the 
point was to say if skb's pcount * current mss < size of the skb, using 
fackets_count to estimate highest sack is problematic... At least path MTU 
discovery comes to my mind as a potential problematic case since it's 
playing with sizing but as I said, I'm not sure if it is bad after all...

> Right, I understand that we really can't even use such an estimate
> because of the "full mss size" assumption.

Are you sure of this "can't"?

I think the LOST marking algorithm can safely be used if highest_sack is 
never underestimated... As long as the search begins at or a point after 
the real highest_sack, the highest SACK block can be located by walking 
until the first SACKED_ACKED is found or TCP encounters a LOST skb (due to 
an earlier timedout).

Overestimating just means some extra backwards walking in the queue (and 
couple lines of code :-))... Of course care should be taken that the 
walk is never started from a sequence number that is after send_head... 
Also, if I understand your queue_find code correctly, the case where 
send_head is pointing to nothing (TCP cannot check it's seqno :-)) and we 
overestimated, the find would return NULL. In such case, instead of doing 
queue_find, a correct thing would be to start from the last skb I think...


-- 
 i.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2007-03-26 21:43 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-03-16 18:41 [RFC PATCHv5 0/5] LOST marking rewrite Ilpo Järvinen
2007-03-16 18:41 ` [RFC PATCHv5 1/5] [TCP]: Add highest_sack seqno, points to globally highest SACK Ilpo Järvinen
2007-03-16 18:41   ` [RFC PATCHv5 2/5] [TCP]: Extracted rexmit hint clearing from the LOST marking code Ilpo Järvinen
2007-03-16 18:41     ` [RFC PATCHv5 3/5] [TCP]: Reworked recovery's TCPCB_LOST marking functions Ilpo Järvinen
2007-03-16 18:41       ` [RFC PATCHv5 4/5] [TCP]: new LOST marker optimizations Ilpo Järvinen
2007-03-16 18:41         ` [RFC PATCHv5 5/5] [TCP]: non-FACK SACK follows conservative SACK loss recovery (RFC3517) Ilpo Järvinen
2007-03-25  4:13           ` David Miller
2007-03-25  4:10         ` [RFC PATCHv5 4/5] [TCP]: new LOST marker optimizations David Miller
2007-03-25  4:09       ` [RFC PATCHv5 3/5] [TCP]: Reworked recovery's TCPCB_LOST marking functions David Miller
2007-03-25  4:05     ` [RFC PATCHv5 2/5] [TCP]: Extracted rexmit hint clearing from the LOST marking code David Miller
2007-03-25  4:04   ` [RFC PATCHv5 1/5] [TCP]: Add highest_sack seqno, points to globally highest SACK David Miller
2007-03-25 21:55     ` Ilpo Järvinen
2007-03-26  1:35       ` David Miller
2007-03-26 21:43         ` Ilpo Järvinen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).