[RFC PATCH net-2.6.25 0/10] [TCP]: Cleanups, tweaks & sacktag recode

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH net-2.6.25 0/10] [TCP]: Cleanups, tweaks & sacktag recode
@ 2007-11-15 13:37 Ilpo Järvinen
  2007-11-15 13:37 ` [PATCH 01/10] [TCP]: Move !in_sack test earlier in sacktag & reorganize if()s Ilpo Järvinen
  0 siblings, 1 reply; 23+ messages in thread
From: Ilpo Järvinen @ 2007-11-15 13:37 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Hi Dave,

Here's the sacktag recv_sack_cache usage rewrite which you were
interested to look at earlier, due to other fixes it has dragged
on this long... Besides that, couple of new bugs^W^Wcleanups &
tweaks are there as well :-). I'll probably have to summon create
tcp_sacktag_state patch back to avoid all that pointer passing
all-around. But those won't be earth-shattering changes. The first
two are probably trivial enough to be accepted as is.

Boot & simple transfer tested, minor fixes after that. I'll try
to arrange time at some point of time to do more verification for
the new sacktag and rfc3517 code, and compare old and new sacktag
to get some numbers from accessed skbs per sacktag operation.

--
 i.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 01/10] [TCP]: Move !in_sack test earlier in sacktag & reorganize if()s
  2007-11-15 13:37 [RFC PATCH net-2.6.25 0/10] [TCP]: Cleanups, tweaks & sacktag recode Ilpo Järvinen
@ 2007-11-15 13:37 ` Ilpo Järvinen
  2007-11-15 13:37   ` [PATCH 02/10] [TCP]: Extend reordering detection to cover CA_Loss partially Ilpo Järvinen
  2007-11-16  3:33   ` [PATCH 01/10] [TCP]: Move !in_sack test earlier in sacktag & reorganize if()s David Miller
  0 siblings, 2 replies; 23+ messages in thread
From: Ilpo Järvinen @ 2007-11-15 13:37 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

All intermediate conditions include it already, make them
simpler as well.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |   31 ++++++++++++++-----------------
 1 files changed, 14 insertions(+), 17 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0f0c1c9..c470b5a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1406,28 +1406,25 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 			if (unlikely(in_sack < 0))
 				break;
 
+			if (!in_sack) {
+				fack_count += tcp_skb_pcount(skb);
+				continue;
+			}
+
 			sacked = TCP_SKB_CB(skb)->sacked;
 
 			/* Account D-SACK for retransmitted packet. */
-			if ((dup_sack && in_sack) &&
-			    (sacked & TCPCB_RETRANS) &&
-			    after(TCP_SKB_CB(skb)->end_seq, tp->undo_marker))
-				tp->undo_retrans--;
-
-			/* The frame is ACKed. */
-			if (!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una)) {
-				if (sacked&TCPCB_RETRANS) {
-					if ((dup_sack && in_sack) &&
-					    (sacked&TCPCB_SACKED_ACKED))
-						reord = min(fack_count, reord);
-				}
-
-				/* Nothing to do; acked frame is about to be dropped. */
-				fack_count += tcp_skb_pcount(skb);
-				continue;
+			if (dup_sack && (sacked & TCPCB_RETRANS)) {
+				if (after(TCP_SKB_CB(skb)->end_seq, tp->undo_marker))
+					tp->undo_retrans--;
+				if (!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una) &&
+				    (sacked & TCPCB_SACKED_ACKED))
+					reord = min(fack_count, reord);
 			}
 
-			if (!in_sack) {
+
+			/* Nothing to do; acked frame is about to be dropped (was ACKed). */
+			if (!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una)) {
 				fack_count += tcp_skb_pcount(skb);
 				continue;
 			}
-- 
1.5.0.6


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 02/10] [TCP]: Extend reordering detection to cover CA_Loss partially
  2007-11-15 13:37 ` [PATCH 01/10] [TCP]: Move !in_sack test earlier in sacktag & reorganize if()s Ilpo Järvinen
@ 2007-11-15 13:37   ` Ilpo Järvinen
  2007-11-15 13:37     ` [PATCH 3/10] [TCP]: non-FACK SACK follows conservative SACK loss recovery Ilpo Järvinen
  2007-11-16  3:35     ` [PATCH 02/10] [TCP]: Extend reordering detection to cover CA_Loss partially David Miller
  2007-11-16  3:33   ` [PATCH 01/10] [TCP]: Move !in_sack test earlier in sacktag & reorganize if()s David Miller
  1 sibling, 2 replies; 23+ messages in thread
From: Ilpo Järvinen @ 2007-11-15 13:37 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

This implements more accurately what is stated in sacktag's
overall comment:

  "Both of these heuristics are not used in Loss state, when
   we cannot account for retransmits accurately."

When CA_Loss state is entered, the state changer ensures that
undo_marker is only set if no TCPCB_RETRANS skbs were found,
thus having non-zero undo_marker in CA_Loss basically tells
that the R-bits still accurately reflect the current state
of TCP.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c470b5a..48c059d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1511,7 +1511,8 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 
 	tcp_verify_left_out(tp);
 
-	if ((reord < tp->fackets_out) && icsk->icsk_ca_state != TCP_CA_Loss &&
+	if ((reord < tp->fackets_out) &&
+	    ((icsk->icsk_ca_state != TCP_CA_Loss) || tp->undo_marker) &&
 	    (!tp->frto_highmark || after(tp->snd_una, tp->frto_highmark)))
 		tcp_update_reordering(sk, tp->fackets_out - reord, 0);
 
-- 
1.5.0.6


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/10] [TCP]: non-FACK SACK follows conservative SACK loss recovery
  2007-11-15 13:37   ` [PATCH 02/10] [TCP]: Extend reordering detection to cover CA_Loss partially Ilpo Järvinen
@ 2007-11-15 13:37     ` Ilpo Järvinen
  2007-11-15 13:37       ` [PATCH 04/10] [TCP]: Convert highest_sack to sk_buff to allow direct access Ilpo Järvinen
  2007-11-16  3:41       ` [PATCH 3/10] [TCP]: non-FACK SACK follows conservative SACK loss recovery David Miller
  2007-11-16  3:35     ` [PATCH 02/10] [TCP]: Extend reordering detection to cover CA_Loss partially David Miller
  1 sibling, 2 replies; 23+ messages in thread
From: Ilpo Järvinen @ 2007-11-15 13:37 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Many assumptions that are true when no reordering or other
strange events happen are not a part of the RFC3517. FACK
implementation is based on such assumptions. Previously (before
the rewrite) the non-FACK SACK was basically doing fast rexmit
and then it times out all skbs when first cumulative ACK arrives,
which cannot really be called SACK based recovery :-).

RFC3517 SACK disables these things:
- Per SKB timeouts & head timeout entry to recovery
- Marking at least one skb while in recovery (RFC3517 does this
  only for the fast retransmission but not for the other skbs
  when cumulative ACKs arrive in the recovery)
- Sacktag's loss detection flavors B and C (see comment before
  tcp_sacktag_write_queue)

This does not implement the "last resort" rule 3 of NextSeg, which
allows retransmissions also when not enough SACK blocks have yet
arrived above a segment for IsLost to return true [RFC3517].

The implementation differs from RFC3517 in these points:
- Rate-halving is used instead of FlightSize / 2
- Instead of using dupACKs to trigger the recovery, the number
  of SACK blocks is used as FACK does with SACK blocks+holes
  (which provides more accurate number). It seems that the
  difference can affect negatively only if the receiver does not
  generate SACK blocks at all even though it claimed to be
  SACK-capable.
- Dupthresh is not a constant one. Dynamical adjustments include
  both holes and sacked segments (equal to what FACK has) due to
  complexity involved in determining the number sacked blocks
  between highest_sack and the reordered segment. Thus it's will
  be an over-estimate.

Implementation note:

tcp_clean_rtx_queue doesn't need a lost_cnt tweak because head
skb at that point cannot be SACKED_ACKED (nor would such
situation last for long enough to cause problems).

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |   80 ++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 62 insertions(+), 18 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 48c059d..c1b5339 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -863,6 +863,9 @@ void tcp_enter_cwr(struct sock *sk, const int set_ssthresh)
  */
 static void tcp_disable_fack(struct tcp_sock *tp)
 {
+	/* RFC3517 uses different metric in lost marker => reset on change */
+	if (tcp_is_fack(tp))
+		tp->lost_skb_hint = NULL;
 	tp->rx_opt.sack_ok &= ~2;
 }
 
@@ -1470,6 +1473,13 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 				tp->sacked_out += tcp_skb_pcount(skb);
 
 				fack_count += tcp_skb_pcount(skb);
+
+				/* Lost marker hint past SACKed? Tweak RFC3517 cnt */
+				if (!tcp_is_fack(tp) && (tp->lost_skb_hint != NULL) &&
+				    before(TCP_SKB_CB(skb)->seq,
+					   TCP_SKB_CB(tp->lost_skb_hint)->seq))
+					tp->lost_cnt_hint += tcp_skb_pcount(skb);
+
 				if (fack_count > tp->fackets_out)
 					tp->fackets_out = fack_count;
 
@@ -1504,7 +1514,7 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 			flag &= ~FLAG_ONLY_ORIG_SACKED;
 	}
 
-	if (tp->retrans_out &&
+	if (tcp_is_fack(tp) && tp->retrans_out &&
 	    after(highest_sack_end_seq, tp->lost_retrans_low) &&
 	    icsk->icsk_ca_state == TCP_CA_Recovery)
 		flag |= tcp_mark_lost_retrans(sk, highest_sack_end_seq);
@@ -1858,6 +1868,26 @@ static inline int tcp_fackets_out(struct tcp_sock *tp)
 	return tcp_is_reno(tp) ? tp->sacked_out+1 : tp->fackets_out;
 }
 
+/* Heurestics to calculate number of duplicate ACKs. There's no dupACKs
+ * counter when SACK is enabled (without SACK, sacked_out is used for
+ * that purpose).
+ *
+ * Instead, with FACK TCP uses fackets_out that includes both SACKed
+ * segments up to the highest received SACK block so far and holes in
+ * between them.
+ *
+ * With reordering, holes may still be in flight, so RFC3517 recovery
+ * uses pure sacked_out (total number of SACKed segments) even though
+ * it violates the RFC that uses duplicate ACKs, often these are equal
+ * but when e.g. out-of-window ACKs or packet duplication occurs,
+ * they differ. Since neither occurs due to loss, TCP should really
+ * ignore them.
+ */
+static inline int tcp_dupack_heurestics(struct tcp_sock *tp)
+{
+	return tcp_is_fack(tp) ? tp->fackets_out : tp->sacked_out + 1;
+}
+
 static inline int tcp_skb_timedout(struct sock *sk, struct sk_buff *skb)
 {
 	return (tcp_time_stamp - TCP_SKB_CB(skb)->when > inet_csk(sk)->icsk_rto);
@@ -1978,13 +2008,13 @@ static int tcp_time_to_recover(struct sock *sk)
 		return 1;
 
 	/* Not-A-Trick#2 : Classic rule... */
-	if (tcp_fackets_out(tp) > tp->reordering)
+	if (tcp_dupack_heurestics(tp) > tp->reordering)
 		return 1;
 
 	/* Trick#3 : when we use RFC2988 timer restart, fast
 	 * retransmit can be triggered by timeout of queue head.
 	 */
-	if (tcp_head_timedout(sk))
+	if (tcp_is_fack(tp) && tcp_head_timedout(sk))
 		return 1;
 
 	/* Trick#4: It is still not OK... But will it be useful to delay
@@ -2017,8 +2047,10 @@ static void tcp_verify_retransmit_hint(struct tcp_sock *tp,
 		tp->retransmit_skb_hint = NULL;
 }
 
-/* Mark head of queue up as lost. */
-static void tcp_mark_head_lost(struct sock *sk, int packets)
+/* Mark head of queue up as lost. With RFC3517 SACK, the packets is
+ * is against sacked "cnt", otherwise it's against facked "cnt"
+ */
+static void tcp_mark_head_lost(struct sock *sk, int packets, int fast_rexmit)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
@@ -2040,8 +2072,13 @@ static void tcp_mark_head_lost(struct sock *sk, int packets)
 		/* this is not the most efficient way to do this... */
 		tp->lost_skb_hint = skb;
 		tp->lost_cnt_hint = cnt;
-		cnt += tcp_skb_pcount(skb);
-		if (cnt > packets || after(TCP_SKB_CB(skb)->end_seq, tp->high_seq))
+
+		if (tcp_is_fack(tp) ||
+		    (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED))
+			cnt += tcp_skb_pcount(skb);
+
+		if (((!fast_rexmit || (tp->lost_out > 0)) && (cnt > packets)) ||
+		     after(TCP_SKB_CB(skb)->end_seq, tp->high_seq))
 			break;
 		if (!(TCP_SKB_CB(skb)->sacked & (TCPCB_SACKED_ACKED|TCPCB_LOST))) {
 			TCP_SKB_CB(skb)->sacked |= TCPCB_LOST;
@@ -2054,17 +2091,22 @@ static void tcp_mark_head_lost(struct sock *sk, int packets)
 
 /* Account newly detected lost packet(s) */
 
-static void tcp_update_scoreboard(struct sock *sk)
+static void tcp_update_scoreboard(struct sock *sk, int fast_rexmit)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (tcp_is_fack(tp)) {
+	if (tcp_is_reno(tp)) {
+		tcp_mark_head_lost(sk, 1, fast_rexmit);
+	} else if (tcp_is_fack(tp)) {
 		int lost = tp->fackets_out - tp->reordering;
 		if (lost <= 0)
 			lost = 1;
-		tcp_mark_head_lost(sk, lost);
+		tcp_mark_head_lost(sk, lost, fast_rexmit);
 	} else {
-		tcp_mark_head_lost(sk, 1);
+		int sacked_upto = tp->sacked_out - tp->reordering;
+		if (sacked_upto < 0)
+			sacked_upto = 0;
+		tcp_mark_head_lost(sk, sacked_upto, fast_rexmit);
 	}
 
 	/* New heuristics: it is possible only after we switched
@@ -2072,7 +2114,7 @@ static void tcp_update_scoreboard(struct sock *sk)
 	 * Hence, we can detect timed out packets during fast
 	 * retransmit without falling to slow start.
 	 */
-	if (!tcp_is_reno(tp) && tcp_head_timedout(sk)) {
+	if (tcp_is_fack(tp) && tcp_head_timedout(sk)) {
 		struct sk_buff *skb;
 
 		skb = tp->scoreboard_skb_hint ? tp->scoreboard_skb_hint
@@ -2245,7 +2287,7 @@ static int tcp_try_undo_partial(struct sock *sk, int acked)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	/* Partial ACK arrived. Force Hoe's retransmit. */
-	int failed = tcp_is_reno(tp) || tp->fackets_out>tp->reordering;
+	int failed = tcp_is_reno(tp) || (tcp_fackets_out(tp) > tp->reordering);
 
 	if (tcp_may_undo(tp)) {
 		/* Plain luck! Hole if filled with delayed
@@ -2379,7 +2421,8 @@ tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
 	struct tcp_sock *tp = tcp_sk(sk);
 	int is_dupack = !(flag&(FLAG_SND_UNA_ADVANCED|FLAG_NOT_DUP));
 	int do_lost = is_dupack || ((flag&FLAG_DATA_SACKED) &&
-				    (tp->fackets_out > tp->reordering));
+				    (tcp_fackets_out(tp) > tp->reordering));
+	int fast_rexmit = 0;
 
 	/* Some technical things:
 	 * 1. Reno does not count dupacks (sacked_out) automatically. */
@@ -2399,11 +2442,11 @@ tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
 		return;
 
 	/* C. Process data loss notification, provided it is valid. */
-	if ((flag&FLAG_DATA_LOST) &&
+	if (tcp_is_fack(tp) && (flag & FLAG_DATA_LOST) &&
 	    before(tp->snd_una, tp->high_seq) &&
 	    icsk->icsk_ca_state != TCP_CA_Open &&
 	    tp->fackets_out > tp->reordering) {
-		tcp_mark_head_lost(sk, tp->fackets_out - tp->reordering);
+		tcp_mark_head_lost(sk, tp->fackets_out-tp->reordering, 0);
 		NET_INC_STATS_BH(LINUX_MIB_TCPLOSS);
 	}
 
@@ -2522,10 +2565,11 @@ tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag)
 		tp->bytes_acked = 0;
 		tp->snd_cwnd_cnt = 0;
 		tcp_set_ca_state(sk, TCP_CA_Recovery);
+		fast_rexmit = 1;
 	}
 
-	if (do_lost || tcp_head_timedout(sk))
-		tcp_update_scoreboard(sk);
+	if (do_lost || (tcp_is_fack(tp) && tcp_head_timedout(sk)))
+		tcp_update_scoreboard(sk, fast_rexmit);
 	tcp_cwnd_down(sk, flag);
 	tcp_xmit_retransmit_queue(sk);
 }
-- 
1.5.0.6


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 04/10] [TCP]: Convert highest_sack to sk_buff to allow direct access
  2007-11-15 13:37     ` [PATCH 3/10] [TCP]: non-FACK SACK follows conservative SACK loss recovery Ilpo Järvinen
@ 2007-11-15 13:37       ` Ilpo Järvinen
  2007-11-15 13:37         ` [PATCH 05/10] [TCP]: Make lost retrans detection more self-contained Ilpo Järvinen
  2007-11-16  3:42         ` [PATCH 04/10] [TCP]: Convert highest_sack to sk_buff to allow direct access David Miller
  2007-11-16  3:41       ` [PATCH 3/10] [TCP]: non-FACK SACK follows conservative SACK loss recovery David Miller
  1 sibling, 2 replies; 23+ messages in thread
From: Ilpo Järvinen @ 2007-11-15 13:37 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

It is going to replace the sack fastpath hint quite soon... :-)

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/linux/tcp.h   |    6 ++++--
 include/net/tcp.h     |   10 ++++++++++
 net/ipv4/tcp_input.c  |   12 ++++++------
 net/ipv4/tcp_output.c |   19 ++++++++++---------
 4 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index bac17c5..34acee6 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -332,8 +332,10 @@ struct tcp_sock {
 
 	struct tcp_sack_block_wire recv_sack_cache[4];
 
-	u32	highest_sack;	/* Start seq of globally highest revd SACK
-				 * (validity guaranteed only if sacked_out > 0) */
+	struct sk_buff *highest_sack;   /* highest skb with SACK received
+					 * (validity guaranteed only if
+					 * sacked_out > 0)
+					 */
 
 	/* from STCP, retrans queue hinting */
 	struct sk_buff* lost_skb_hint;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index d695cea..3444647 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1306,6 +1306,16 @@ static inline int tcp_write_queue_empty(struct sock *sk)
 	return skb_queue_empty(&sk->sk_write_queue);
 }
 
+/* Start sequence of the highest skb with SACKed bit, valid only if
+ * sacked > 0 or when the caller has ensured validity by itself.
+ */
+static inline u32 tcp_highest_sack_seq(struct tcp_sock *tp)
+{
+	if (!tp->sacked_out)
+		return tp->snd_una;
+	return TCP_SKB_CB(tp->highest_sack)->seq;
+}
+
 /* /proc */
 enum tcp_seq_states {
 	TCP_SEQ_STATE_LISTENING,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ef8187b..c25704f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1245,7 +1245,7 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 	int num_sacks = (ptr[1] - TCPOLEN_SACK_BASE)>>3;
 	int reord = tp->packets_out;
 	int prior_fackets;
-	u32 highest_sack_end_seq = tp->lost_retrans_low;
+	u32 highest_sack_end_seq;
 	int flag = 0;
 	int found_dup_sack = 0;
 	int cached_fack_count;
@@ -1256,7 +1256,7 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 	if (!tp->sacked_out) {
 		if (WARN_ON(tp->fackets_out))
 			tp->fackets_out = 0;
-		tp->highest_sack = tp->snd_una;
+		tp->highest_sack = tcp_write_queue_head(sk);
 	}
 	prior_fackets = tp->fackets_out;
 
@@ -1483,10 +1483,9 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 				if (fack_count > tp->fackets_out)
 					tp->fackets_out = fack_count;
 
-				if (after(TCP_SKB_CB(skb)->seq, tp->highest_sack)) {
-					tp->highest_sack = TCP_SKB_CB(skb)->seq;
-					highest_sack_end_seq = TCP_SKB_CB(skb)->end_seq;
-				}
+				if (after(TCP_SKB_CB(skb)->seq, tcp_highest_sack_seq(tp)))
+					tp->highest_sack = skb;
+
 			} else {
 				if (dup_sack && (sacked&TCPCB_RETRANS))
 					reord = min(fack_count, reord);
@@ -1514,6 +1513,7 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 			flag &= ~FLAG_ONLY_ORIG_SACKED;
 	}
 
+	highest_sack_end_seq = TCP_SKB_CB(tp->highest_sack)->end_seq;
 	if (tcp_is_fack(tp) && tp->retrans_out &&
 	    after(highest_sack_end_seq, tp->lost_retrans_low) &&
 	    icsk->icsk_ca_state == TCP_CA_Recovery)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 324b420..a5863f9 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -657,13 +657,15 @@ static void tcp_set_skb_tso_segs(struct sock *sk, struct sk_buff *skb, unsigned
  * tweak SACK fastpath hint too as it would overwrite all changes unless
  * hint is also changed.
  */
-static void tcp_adjust_fackets_out(struct tcp_sock *tp, struct sk_buff *skb,
+static void tcp_adjust_fackets_out(struct sock *sk, struct sk_buff *skb,
 				   int decr)
 {
+	struct tcp_sock *tp = tcp_sk(sk);
+
 	if (!tp->sacked_out || tcp_is_reno(tp))
 		return;
 
-	if (!before(tp->highest_sack, TCP_SKB_CB(skb)->seq))
+	if (!before(tcp_highest_sack_seq(tp), TCP_SKB_CB(skb)->seq))
 		tp->fackets_out -= decr;
 
 	/* cnt_hint is "off-by-one" compared with fackets_out (see sacktag) */
@@ -712,9 +714,8 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len, unsigned int mss
 	TCP_SKB_CB(buff)->end_seq = TCP_SKB_CB(skb)->end_seq;
 	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(buff)->seq;
 
-	if (tcp_is_sack(tp) && tp->sacked_out &&
-	    (TCP_SKB_CB(skb)->seq == tp->highest_sack))
-		tp->highest_sack = TCP_SKB_CB(buff)->seq;
+	if (tcp_is_sack(tp) && tp->sacked_out && (skb == tp->highest_sack))
+		tp->highest_sack = buff;
 
 	/* PSH and FIN should only be set in the second packet. */
 	flags = TCP_SKB_CB(skb)->flags;
@@ -772,7 +773,7 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len, unsigned int mss
 			tcp_dec_pcount_approx_int(&tp->sacked_out, diff);
 			tcp_verify_left_out(tp);
 		}
-		tcp_adjust_fackets_out(tp, skb, diff);
+		tcp_adjust_fackets_out(sk, skb, diff);
 	}
 
 	/* Link BUFF into the send queue. */
@@ -1720,7 +1721,7 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *skb, int m
 		       tcp_skb_pcount(next_skb) != 1);
 
 		if (WARN_ON(tcp_is_sack(tp) && tp->sacked_out &&
-		    (TCP_SKB_CB(next_skb)->seq == tp->highest_sack)))
+		    (next_skb == tp->highest_sack)))
 			return;
 
 		/* Ok.	We will be able to collapse the packet. */
@@ -1755,7 +1756,7 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *skb, int m
 		if (tcp_is_reno(tp) && tp->sacked_out)
 			tcp_dec_pcount_approx(&tp->sacked_out, next_skb);
 
-		tcp_adjust_fackets_out(tp, next_skb, tcp_skb_pcount(next_skb));
+		tcp_adjust_fackets_out(sk, next_skb, tcp_skb_pcount(next_skb));
 		tp->packets_out -= tcp_skb_pcount(next_skb);
 
 		/* changed transmit queue under us so clear hints */
@@ -2036,7 +2037,7 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
 			break;
 		tp->forward_skb_hint = skb;
 
-		if (after(TCP_SKB_CB(skb)->seq, tp->highest_sack))
+		if (after(TCP_SKB_CB(skb)->seq, tcp_highest_sack_seq(tp)))
 			break;
 
 		if (tcp_packets_in_flight(tp) >= tp->snd_cwnd)
-- 
1.5.0.6


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 05/10] [TCP]: Make lost retrans detection more self-contained
  2007-11-15 13:37       ` [PATCH 04/10] [TCP]: Convert highest_sack to sk_buff to allow direct access Ilpo Järvinen
@ 2007-11-15 13:37         ` Ilpo Järvinen
  2007-11-15 13:37           ` [PATCH 06/10] [TCP]: Prior_fackets can be replaced by highest_sack seq Ilpo Järvinen
  2007-11-16  3:43           ` [PATCH 05/10] [TCP]: Make lost retrans detection more self-contained David Miller
  2007-11-16  3:42         ` [PATCH 04/10] [TCP]: Convert highest_sack to sk_buff to allow direct access David Miller
  1 sibling, 2 replies; 23+ messages in thread
From: Ilpo Järvinen @ 2007-11-15 13:37 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Highest_sack_end_seq is no longer calculated in the loop,
thus it can be pushed to the worker function altogether
making that function independent of the sacktag.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |   20 +++++++++++---------
 1 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c25704f..b7af304 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1115,16 +1115,23 @@ static int tcp_is_sackblock_valid(struct tcp_sock *tp, int is_dsack,
  *
  * Search retransmitted skbs from write_queue that were sent when snd_nxt was
  * less than what is now known to be received by the other end (derived from
- * SACK blocks by the caller). Also calculate the lowest snd_nxt among the
- * remaining retransmitted skbs to avoid some costly processing per ACKs.
+ * highest SACK block). Also calculate the lowest snd_nxt among the remaining
+ * retransmitted skbs to avoid some costly processing per ACKs.
  */
-static int tcp_mark_lost_retrans(struct sock *sk, u32 received_upto)
+static int tcp_mark_lost_retrans(struct sock *sk)
 {
+	const struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
 	int flag = 0;
 	int cnt = 0;
 	u32 new_low_seq = tp->snd_nxt;
+	u32 received_upto = TCP_SKB_CB(tp->highest_sack)->end_seq;
+
+	if (!tcp_is_fack(tp) || !tp->retrans_out ||
+	    !after(received_upto, tp->lost_retrans_low) ||
+	    icsk->icsk_ca_state != TCP_CA_Recovery)
+		return flag;
 
 	tcp_for_write_queue(skb, sk) {
 		u32 ack_seq = TCP_SKB_CB(skb)->ack_seq;
@@ -1245,7 +1252,6 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 	int num_sacks = (ptr[1] - TCPOLEN_SACK_BASE)>>3;
 	int reord = tp->packets_out;
 	int prior_fackets;
-	u32 highest_sack_end_seq;
 	int flag = 0;
 	int found_dup_sack = 0;
 	int cached_fack_count;
@@ -1513,11 +1519,7 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 			flag &= ~FLAG_ONLY_ORIG_SACKED;
 	}
 
-	highest_sack_end_seq = TCP_SKB_CB(tp->highest_sack)->end_seq;
-	if (tcp_is_fack(tp) && tp->retrans_out &&
-	    after(highest_sack_end_seq, tp->lost_retrans_low) &&
-	    icsk->icsk_ca_state == TCP_CA_Recovery)
-		flag |= tcp_mark_lost_retrans(sk, highest_sack_end_seq);
+	flag |= tcp_mark_lost_retrans(sk);
 
 	tcp_verify_left_out(tp);
 
-- 
1.5.0.6


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 06/10] [TCP]: Prior_fackets can be replaced by highest_sack seq
  2007-11-15 13:37         ` [PATCH 05/10] [TCP]: Make lost retrans detection more self-contained Ilpo Järvinen
@ 2007-11-15 13:37           ` Ilpo Järvinen
  2007-11-15 13:37             ` [PATCH 07/10] [TCP]: Create tcp_sacktag_one() Ilpo Järvinen
  2007-11-16  3:44             ` [PATCH 06/10] [TCP]: Prior_fackets can be replaced by highest_sack seq David Miller
  2007-11-16  3:43           ` [PATCH 05/10] [TCP]: Make lost retrans detection more self-contained David Miller
  1 sibling, 2 replies; 23+ messages in thread
From: Ilpo Järvinen @ 2007-11-15 13:37 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b7af304..29fff81 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1251,7 +1251,6 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 	struct sk_buff *cached_skb;
 	int num_sacks = (ptr[1] - TCPOLEN_SACK_BASE)>>3;
 	int reord = tp->packets_out;
-	int prior_fackets;
 	int flag = 0;
 	int found_dup_sack = 0;
 	int cached_fack_count;
@@ -1264,7 +1263,6 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 			tp->fackets_out = 0;
 		tp->highest_sack = tcp_write_queue_head(sk);
 	}
-	prior_fackets = tp->fackets_out;
 
 	found_dup_sack = tcp_check_dsack(tp, ack_skb, sp,
 					 num_sacks, prior_snd_una);
@@ -1457,7 +1455,8 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 						/* New sack for not retransmitted frame,
 						 * which was in hole. It is reordering.
 						 */
-						if (fack_count < prior_fackets)
+						if (before(TCP_SKB_CB(skb)->seq,
+							   tcp_highest_sack_seq(tp)))
 							reord = min(fack_count, reord);
 
 						/* SACK enhanced F-RTO (RFC4138; Appendix B) */
-- 
1.5.0.6


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 07/10] [TCP]: Create tcp_sacktag_one().
  2007-11-15 13:37           ` [PATCH 06/10] [TCP]: Prior_fackets can be replaced by highest_sack seq Ilpo Järvinen
@ 2007-11-15 13:37             ` Ilpo Järvinen
  2007-11-15 13:37               ` [PATCH 08/10] [TCP]: Earlier SACK block verification & simplify access to them Ilpo Järvinen
  2007-11-16  3:45               ` [PATCH 07/10] [TCP]: Create tcp_sacktag_one() David Miller
  2007-11-16  3:44             ` [PATCH 06/10] [TCP]: Prior_fackets can be replaced by highest_sack seq David Miller
  1 sibling, 2 replies; 23+ messages in thread
From: Ilpo Järvinen @ 2007-11-15 13:37 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Worker function that implements the main logic of
the inner-most loop of tcp_sacktag_write_queue().

Idea was originally presented by David S. Miller.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |  192 +++++++++++++++++++++++++-------------------------
 1 files changed, 96 insertions(+), 96 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 29fff81..b301abb 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1240,6 +1240,99 @@ static int tcp_match_skb_to_sack(struct sock *sk, struct sk_buff *skb,
 	return in_sack;
 }
 
+static int tcp_sacktag_one(struct sk_buff *skb, struct tcp_sock *tp,
+			   int *reord, int dup_sack, int fack_count)
+{
+	u8 sacked = TCP_SKB_CB(skb)->sacked;
+	int flag = 0;
+
+	/* Account D-SACK for retransmitted packet. */
+	if (dup_sack && (sacked & TCPCB_RETRANS)) {
+		if (after(TCP_SKB_CB(skb)->end_seq, tp->undo_marker))
+			tp->undo_retrans--;
+		if (!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una) &&
+		    (sacked & TCPCB_SACKED_ACKED))
+			*reord = min(fack_count, *reord);
+	}
+
+	/* Nothing to do; acked frame is about to be dropped (was ACKed). */
+	if (!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
+		return flag;
+
+	if (!(sacked & TCPCB_SACKED_ACKED)) {
+		if (sacked & TCPCB_SACKED_RETRANS) {
+			/* If the segment is not tagged as lost,
+			 * we do not clear RETRANS, believing
+			 * that retransmission is still in flight.
+			 */
+			if (sacked & TCPCB_LOST) {
+				TCP_SKB_CB(skb)->sacked &=
+					~(TCPCB_LOST|TCPCB_SACKED_RETRANS);
+				tp->lost_out -= tcp_skb_pcount(skb);
+				tp->retrans_out -= tcp_skb_pcount(skb);
+
+				/* clear lost hint */
+				tp->retransmit_skb_hint = NULL;
+			}
+		} else {
+			if (!(sacked & TCPCB_RETRANS)) {
+				/* New sack for not retransmitted frame,
+				 * which was in hole. It is reordering.
+				 */
+				if (before(TCP_SKB_CB(skb)->seq,
+					   tcp_highest_sack_seq(tp)))
+					*reord = min(fack_count, *reord);
+
+				/* SACK enhanced F-RTO (RFC4138; Appendix B) */
+				if (!after(TCP_SKB_CB(skb)->end_seq, tp->frto_highmark))
+					flag |= FLAG_ONLY_ORIG_SACKED;
+			}
+
+			if (sacked & TCPCB_LOST) {
+				TCP_SKB_CB(skb)->sacked &= ~TCPCB_LOST;
+				tp->lost_out -= tcp_skb_pcount(skb);
+
+				/* clear lost hint */
+				tp->retransmit_skb_hint = NULL;
+			}
+		}
+
+		TCP_SKB_CB(skb)->sacked |= TCPCB_SACKED_ACKED;
+		flag |= FLAG_DATA_SACKED;
+		tp->sacked_out += tcp_skb_pcount(skb);
+
+		fack_count += tcp_skb_pcount(skb);
+
+		/* Lost marker hint past SACKed? Tweak RFC3517 cnt */
+		if (!tcp_is_fack(tp) && (tp->lost_skb_hint != NULL) &&
+		    before(TCP_SKB_CB(skb)->seq,
+			   TCP_SKB_CB(tp->lost_skb_hint)->seq))
+			tp->lost_cnt_hint += tcp_skb_pcount(skb);
+
+		if (fack_count > tp->fackets_out)
+			tp->fackets_out = fack_count;
+
+		if (after(TCP_SKB_CB(skb)->seq, tcp_highest_sack_seq(tp)))
+			tp->highest_sack = skb;
+
+	} else {
+		if (dup_sack && (sacked & TCPCB_RETRANS))
+			*reord = min(fack_count, *reord);
+	}
+
+	/* D-SACK. We can detect redundant retransmission in S|R and plain R
+	 * frames and clear it. undo_retrans is decreased above, L|R frames
+	 * are accounted above as well.
+	 */
+	if (dup_sack && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS)) {
+		TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
+		tp->retrans_out -= tcp_skb_pcount(skb);
+		tp->retransmit_skb_hint = NULL;
+	}
+
+	return flag;
+}
+
 static int
 tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_una)
 {
@@ -1375,7 +1468,6 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 
 		tcp_for_write_queue_from(skb, sk) {
 			int in_sack = 0;
-			u8 sacked;
 
 			if (skb == tcp_send_head(sk))
 				break;
@@ -1413,102 +1505,10 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 			if (unlikely(in_sack < 0))
 				break;
 
-			if (!in_sack) {
-				fack_count += tcp_skb_pcount(skb);
-				continue;
-			}
-
-			sacked = TCP_SKB_CB(skb)->sacked;
-
-			/* Account D-SACK for retransmitted packet. */
-			if (dup_sack && (sacked & TCPCB_RETRANS)) {
-				if (after(TCP_SKB_CB(skb)->end_seq, tp->undo_marker))
-					tp->undo_retrans--;
-				if (!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una) &&
-				    (sacked & TCPCB_SACKED_ACKED))
-					reord = min(fack_count, reord);
-			}
-
-
-			/* Nothing to do; acked frame is about to be dropped (was ACKed). */
-			if (!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una)) {
-				fack_count += tcp_skb_pcount(skb);
-				continue;
-			}
-
-			if (!(sacked&TCPCB_SACKED_ACKED)) {
-				if (sacked & TCPCB_SACKED_RETRANS) {
-					/* If the segment is not tagged as lost,
-					 * we do not clear RETRANS, believing
-					 * that retransmission is still in flight.
-					 */
-					if (sacked & TCPCB_LOST) {
-						TCP_SKB_CB(skb)->sacked &= ~(TCPCB_LOST|TCPCB_SACKED_RETRANS);
-						tp->lost_out -= tcp_skb_pcount(skb);
-						tp->retrans_out -= tcp_skb_pcount(skb);
-
-						/* clear lost hint */
-						tp->retransmit_skb_hint = NULL;
-					}
-				} else {
-					if (!(sacked & TCPCB_RETRANS)) {
-						/* New sack for not retransmitted frame,
-						 * which was in hole. It is reordering.
-						 */
-						if (before(TCP_SKB_CB(skb)->seq,
-							   tcp_highest_sack_seq(tp)))
-							reord = min(fack_count, reord);
-
-						/* SACK enhanced F-RTO (RFC4138; Appendix B) */
-						if (!after(TCP_SKB_CB(skb)->end_seq, tp->frto_highmark))
-							flag |= FLAG_ONLY_ORIG_SACKED;
-					}
-
-					if (sacked & TCPCB_LOST) {
-						TCP_SKB_CB(skb)->sacked &= ~TCPCB_LOST;
-						tp->lost_out -= tcp_skb_pcount(skb);
-
-						/* clear lost hint */
-						tp->retransmit_skb_hint = NULL;
-					}
-				}
-
-				TCP_SKB_CB(skb)->sacked |= TCPCB_SACKED_ACKED;
-				flag |= FLAG_DATA_SACKED;
-				tp->sacked_out += tcp_skb_pcount(skb);
-
-				fack_count += tcp_skb_pcount(skb);
-
-				/* Lost marker hint past SACKed? Tweak RFC3517 cnt */
-				if (!tcp_is_fack(tp) && (tp->lost_skb_hint != NULL) &&
-				    before(TCP_SKB_CB(skb)->seq,
-					   TCP_SKB_CB(tp->lost_skb_hint)->seq))
-					tp->lost_cnt_hint += tcp_skb_pcount(skb);
-
-				if (fack_count > tp->fackets_out)
-					tp->fackets_out = fack_count;
+			if (in_sack)
+				flag |= tcp_sacktag_one(skb, tp, &reord, dup_sack, fack_count);
 
-				if (after(TCP_SKB_CB(skb)->seq, tcp_highest_sack_seq(tp)))
-					tp->highest_sack = skb;
-
-			} else {
-				if (dup_sack && (sacked&TCPCB_RETRANS))
-					reord = min(fack_count, reord);
-
-				fack_count += tcp_skb_pcount(skb);
-			}
-
-			/* D-SACK. We can detect redundant retransmission
-			 * in S|R and plain R frames and clear it.
-			 * undo_retrans is decreased above, L|R frames
-			 * are accounted above as well.
-			 */
-			if (dup_sack &&
-			    (TCP_SKB_CB(skb)->sacked&TCPCB_SACKED_RETRANS)) {
-				TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
-				tp->retrans_out -= tcp_skb_pcount(skb);
-				tp->retransmit_skb_hint = NULL;
-			}
+			fack_count += tcp_skb_pcount(skb);
 		}
 
 		/* SACK enhanced FRTO (RFC4138, Appendix B): Clearing correct
-- 
1.5.0.6


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 08/10] [TCP]: Earlier SACK block verification & simplify access to them
  2007-11-15 13:37             ` [PATCH 07/10] [TCP]: Create tcp_sacktag_one() Ilpo Järvinen
@ 2007-11-15 13:37               ` Ilpo Järvinen
  2007-11-15 13:37                 ` [RFC PATCH 09/10] [TCP]: Rewrite SACK block processing & sack_recv_cache use Ilpo Järvinen
  2007-11-16  3:50                 ` [PATCH 08/10] [TCP]: Earlier SACK block verification & simplify access to them David Miller
  2007-11-16  3:45               ` [PATCH 07/10] [TCP]: Create tcp_sacktag_one() David Miller
  1 sibling, 2 replies; 23+ messages in thread
From: Ilpo Järvinen @ 2007-11-15 13:37 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/linux/tcp.h  |    2 +-
 net/ipv4/tcp_input.c |   85 ++++++++++++++++++++++++++++++--------------------
 2 files changed, 52 insertions(+), 35 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 34acee6..794497c 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -330,7 +330,7 @@ struct tcp_sock {
 	struct tcp_sack_block duplicate_sack[1]; /* D-SACK block */
 	struct tcp_sack_block selective_acks[4]; /* The SACKS themselves*/
 
-	struct tcp_sack_block_wire recv_sack_cache[4];
+	struct tcp_sack_block recv_sack_cache[4];
 
 	struct sk_buff *highest_sack;   /* highest skb with SACK received
 					 * (validity guaranteed only if
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b301abb..69f2f79 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1340,9 +1340,11 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 	struct tcp_sock *tp = tcp_sk(sk);
 	unsigned char *ptr = (skb_transport_header(ack_skb) +
 			      TCP_SKB_CB(ack_skb)->sacked);
-	struct tcp_sack_block_wire *sp = (struct tcp_sack_block_wire *)(ptr+2);
+	struct tcp_sack_block_wire *sp_wire = (struct tcp_sack_block_wire *)(ptr+2);
+	struct tcp_sack_block sp[4];
 	struct sk_buff *cached_skb;
 	int num_sacks = (ptr[1] - TCPOLEN_SACK_BASE)>>3;
+	int used_sacks;
 	int reord = tp->packets_out;
 	int flag = 0;
 	int found_dup_sack = 0;
@@ -1357,7 +1359,7 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 		tp->highest_sack = tcp_write_queue_head(sk);
 	}
 
-	found_dup_sack = tcp_check_dsack(tp, ack_skb, sp,
+	found_dup_sack = tcp_check_dsack(tp, ack_skb, sp_wire,
 					 num_sacks, prior_snd_una);
 	if (found_dup_sack)
 		flag |= FLAG_DSACKING_ACK;
@@ -1372,14 +1374,49 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 	if (!tp->packets_out)
 		goto out;
 
+	used_sacks = 0;
+	first_sack_index = 0;
+	for (i = 0; i < num_sacks; i++) {
+		int dup_sack = !i && found_dup_sack;
+
+		sp[used_sacks].start_seq = ntohl(get_unaligned(&sp_wire[i].start_seq));
+		sp[used_sacks].end_seq = ntohl(get_unaligned(&sp_wire[i].end_seq));
+
+		if (!tcp_is_sackblock_valid(tp, dup_sack,
+					    sp[used_sacks].start_seq,
+					    sp[used_sacks].end_seq)) {
+			if (dup_sack) {
+				if (!tp->undo_marker)
+					NET_INC_STATS_BH(LINUX_MIB_TCPDSACKIGNOREDNOUNDO);
+				else
+					NET_INC_STATS_BH(LINUX_MIB_TCPDSACKIGNOREDOLD);
+			} else {
+				/* Don't count olds caused by ACK reordering */
+				if ((TCP_SKB_CB(ack_skb)->ack_seq != tp->snd_una) &&
+				    !after(sp[used_sacks].end_seq, tp->snd_una))
+					continue;
+				NET_INC_STATS_BH(LINUX_MIB_TCPSACKDISCARD);
+			}
+			if (i == 0)
+				first_sack_index = -1;
+			continue;
+		}
+
+		/* Ignore very old stuff early */
+		if (!after(sp[used_sacks].end_seq, prior_snd_una))
+			continue;
+
+		used_sacks++;
+	}
+
 	/* SACK fastpath:
 	 * if the only SACK change is the increase of the end_seq of
 	 * the first block then only apply that SACK block
 	 * and use retrans queue hinting otherwise slowpath */
 	force_one_sack = 1;
-	for (i = 0; i < num_sacks; i++) {
-		__be32 start_seq = sp[i].start_seq;
-		__be32 end_seq = sp[i].end_seq;
+	for (i = 0; i < used_sacks; i++) {
+		u32 start_seq = sp[i].start_seq;
+		u32 end_seq = sp[i].end_seq;
 
 		if (i == 0) {
 			if (tp->recv_sack_cache[i].start_seq != start_seq)
@@ -1398,19 +1435,17 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 		tp->recv_sack_cache[i].end_seq = 0;
 	}
 
-	first_sack_index = 0;
 	if (force_one_sack)
-		num_sacks = 1;
+		used_sacks = 1;
 	else {
 		int j;
 		tp->fastpath_skb_hint = NULL;
 
 		/* order SACK blocks to allow in order walk of the retrans queue */
-		for (i = num_sacks-1; i > 0; i--) {
+		for (i = used_sacks - 1; i > 0; i--) {
 			for (j = 0; j < i; j++){
-				if (after(ntohl(sp[j].start_seq),
-					  ntohl(sp[j+1].start_seq))){
-					struct tcp_sack_block_wire tmp;
+				if (after(sp[j].start_seq, sp[j+1].start_seq)) {
+					struct tcp_sack_block tmp;
 
 					tmp = sp[j];
 					sp[j] = sp[j+1];
@@ -1433,32 +1468,14 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 		cached_fack_count = 0;
 	}
 
-	for (i = 0; i < num_sacks; i++) {
+	for (i = 0; i < used_sacks; i++) {
 		struct sk_buff *skb;
-		__u32 start_seq = ntohl(sp->start_seq);
-		__u32 end_seq = ntohl(sp->end_seq);
+		u32 start_seq = sp[i].start_seq;
+		u32 end_seq = sp[i].end_seq;
 		int fack_count;
 		int dup_sack = (found_dup_sack && (i == first_sack_index));
 		int next_dup = (found_dup_sack && (i+1 == first_sack_index));
 
-		sp++;
-
-		if (!tcp_is_sackblock_valid(tp, dup_sack, start_seq, end_seq)) {
-			if (dup_sack) {
-				if (!tp->undo_marker)
-					NET_INC_STATS_BH(LINUX_MIB_TCPDSACKIGNOREDNOUNDO);
-				else
-					NET_INC_STATS_BH(LINUX_MIB_TCPDSACKIGNOREDOLD);
-			} else {
-				/* Don't count olds caused by ACK reordering */
-				if ((TCP_SKB_CB(ack_skb)->ack_seq != tp->snd_una) &&
-				    !after(end_seq, tp->snd_una))
-					continue;
-				NET_INC_STATS_BH(LINUX_MIB_TCPSACKDISCARD);
-			}
-			continue;
-		}
-
 		skb = cached_skb;
 		fack_count = cached_fack_count;
 
@@ -1489,8 +1506,8 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 
 			/* Due to sorting DSACK may reside within this SACK block! */
 			if (next_dup) {
-				u32 dup_start = ntohl(sp->start_seq);
-				u32 dup_end = ntohl(sp->end_seq);
+				u32 dup_start = sp[i+1].start_seq;
+				u32 dup_end = sp[i+1].end_seq;
 
 				if (before(TCP_SKB_CB(skb)->seq, dup_end)) {
 					in_sack = tcp_match_skb_to_sack(sk, skb, dup_start, dup_end);
-- 
1.5.0.6


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC PATCH 09/10] [TCP]: Rewrite SACK block processing & sack_recv_cache use
  2007-11-15 13:37               ` [PATCH 08/10] [TCP]: Earlier SACK block verification & simplify access to them Ilpo Järvinen
@ 2007-11-15 13:37                 ` Ilpo Järvinen
  2007-11-15 13:37                   ` [PATCH 10/10] [TCP]: Track sacktag (DEVEL PATCH) Ilpo Järvinen
  2007-11-16  4:00                   ` [RFC PATCH 09/10] [TCP]: Rewrite SACK block processing & sack_recv_cache use David Miller
  2007-11-16  3:50                 ` [PATCH 08/10] [TCP]: Earlier SACK block verification & simplify access to them David Miller
  1 sibling, 2 replies; 23+ messages in thread
From: Ilpo Järvinen @ 2007-11-15 13:37 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Key points of this patch are:

  - In case new SACK information is advance only type, no skb
    processing below previously discovered highest point is done
  - Optimize cases below highest point too since there's no need
    to always go up to highest point (which is very likely still
    present in that SACK), this is not entirely true though
    because I'm dropping the fastpath_skb_hint which could
    previously optimize those cases even better. Whether that's
    significant, I'm not too sure.

Corrently it will provide skipping by walking. Combined with
RB-tree, all skipping would become fast too regardless of window
size (can be done incrementally later).

Previously a number of cases in TCP SACK processing fails to
take advantage of costly stored information in sack_recv_cache,
most importantly, expected events such as cumulative ACK and new
hole ACKs. Processing on such ACKs result in rather long walks
building up latencies (which easily gets nasty when window is
huge). Those latencies are often completely unnecessary
compared with the amount of _new_ information received, usually
for cumulative ACK there's no new information at all, yet TCP
walks whole queue unnecessary potentially taking a number of
costly cache misses on the way, etc.!

Since the inclusion of highest_sack, there's a lot information
that is very likely redundant (SACK fastpath hint stuff,
fackets_out, highest_sack), though there's no ultimate guarantee
that they'll remain the same whole the time (in all unearthly
scenarios). Take advantage of this knowledge here and drop
fastpath hint and use direct access to highest SACKed skb as
a replacement.

Effectively "special cased" fastpath is dropped. This change
adds some complexity to introduce better coveraged "fastpath",
though the added complexity should make TCP behave more cache
friendly.

The current ACK's SACK blocks are compared against each cached
block individially and only ranges that are new are then scanned
by the high constant walk. For other parts of write queue, even
when in previously known part of the SACK blocks, a faster skip
function is used (if necessary at all). In addition, whenever
possible, TCP fast-forwards to highest_sack skb that was made
available by an earlier patch. In typical case, no other things
but this fast-forward and mandatory markings after that occur
making the access pattern quite similar to the former fastpath
"special case".

DSACKs are special case that must always be walked.

The local to recv_sack_cache copying could be more intelligent
w.r.t DSACKs which are likely to be there only once but that
is left to a separate patch.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/linux/tcp.h   |    3 -
 include/net/tcp.h     |    1 -
 net/ipv4/tcp_input.c  |  277 +++++++++++++++++++++++++++++++------------------
 net/ipv4/tcp_output.c |   14 +---
 4 files changed, 175 insertions(+), 120 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 794497c..08027f1 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -343,10 +343,7 @@ struct tcp_sock {
 	struct sk_buff *scoreboard_skb_hint;
 	struct sk_buff *retransmit_skb_hint;
 	struct sk_buff *forward_skb_hint;
-	struct sk_buff *fastpath_skb_hint;
 
-	int     fastpath_cnt_hint;	/* Lags behind by current skb's pcount
-					 * compared to respective fackets_out */
 	int     lost_cnt_hint;
 	int     retransmit_cnt_hint;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 3444647..0844261 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1078,7 +1078,6 @@ static inline void tcp_clear_retrans_hints_partial(struct tcp_sock *tp)
 static inline void tcp_clear_all_retrans_hints(struct tcp_sock *tp)
 {
 	tcp_clear_retrans_hints_partial(tp);
-	tp->fastpath_skb_hint = NULL;
 }
 
 /* MD5 Signature */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 69f2f79..5833b01 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1333,6 +1333,88 @@ static int tcp_sacktag_one(struct sk_buff *skb, struct tcp_sock *tp,
 	return flag;
 }
 
+static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk,
+					struct tcp_sack_block *next_dup,
+					u32 start_seq, u32 end_seq,
+					int dup_sack_in, int *fack_count,
+					int *reord, int *flag)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	tcp_for_write_queue_from(skb, sk) {
+		int in_sack = 0;
+		int dup_sack = dup_sack_in;
+
+		if (skb == tcp_send_head(sk))
+			break;
+
+		/* queue is in-order => we can short-circuit the walk early */
+		if (!before(TCP_SKB_CB(skb)->seq, end_seq))
+			break;
+
+		if ((next_dup != NULL) &&
+		    before(TCP_SKB_CB(skb)->seq, next_dup->end_seq)) {
+			in_sack = tcp_match_skb_to_sack(sk, skb,
+							next_dup->start_seq,
+							next_dup->end_seq);
+			if (in_sack > 0)
+				dup_sack = 1;
+		}
+
+		if (in_sack <= 0)
+			in_sack = tcp_match_skb_to_sack(sk, skb, start_seq, end_seq);
+		if (unlikely(in_sack < 0))
+			break;
+
+		if (in_sack)
+			*flag |= tcp_sacktag_one(skb, tp, reord, dup_sack, *fack_count);
+
+		*fack_count += tcp_skb_pcount(skb);
+	}
+	return skb;
+}
+
+/* Avoid all extra work that is being done by sacktag while walking in
+ * a normal way
+ */
+static struct sk_buff *tcp_sacktag_skip(struct sk_buff *skb, struct sock *sk,
+					u32 skip_to_seq)
+{
+	tcp_for_write_queue_from(skb, sk) {
+		if (skb == tcp_send_head(sk))
+			break;
+
+		if (before(TCP_SKB_CB(skb)->end_seq, skip_to_seq))
+			break;
+	}
+	return skb;
+}
+
+static struct sk_buff *tcp_maybe_skipping_dsack(struct sk_buff *skb,
+						struct sock *sk,
+						struct tcp_sack_block *next_dup,
+						u32 skip_to_seq,
+						int *fack_count, int *reord,
+						int *flag)
+{
+	if (next_dup == NULL)
+		return skb;
+
+	if (before(next_dup->start_seq, skip_to_seq)) {
+		skb = tcp_sacktag_skip(skb, sk, next_dup->start_seq);
+		tcp_sacktag_walk(skb, sk, NULL,
+				 next_dup->start_seq, next_dup->end_seq,
+				 1, fack_count, reord, flag);
+	}
+
+	return skb;
+}
+
+static int tcp_sack_cache_ok(struct tcp_sock *tp, struct tcp_sack_block *cache)
+{
+	return cache < tp->recv_sack_cache + ARRAY_SIZE(tp->recv_sack_cache);
+}
+
 static int
 tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_una)
 {
@@ -1342,16 +1424,16 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 			      TCP_SKB_CB(ack_skb)->sacked);
 	struct tcp_sack_block_wire *sp_wire = (struct tcp_sack_block_wire *)(ptr+2);
 	struct tcp_sack_block sp[4];
-	struct sk_buff *cached_skb;
+	struct tcp_sack_block *cache;
+	struct sk_buff *skb;
 	int num_sacks = (ptr[1] - TCPOLEN_SACK_BASE)>>3;
 	int used_sacks;
 	int reord = tp->packets_out;
 	int flag = 0;
 	int found_dup_sack = 0;
-	int cached_fack_count;
-	int i;
+	int fack_count;
+	int i, j;
 	int first_sack_index;
-	int force_one_sack;
 
 	if (!tp->sacked_out) {
 		if (WARN_ON(tp->fackets_out))
@@ -1409,132 +1491,123 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 		used_sacks++;
 	}
 
-	/* SACK fastpath:
-	 * if the only SACK change is the increase of the end_seq of
-	 * the first block then only apply that SACK block
-	 * and use retrans queue hinting otherwise slowpath */
-	force_one_sack = 1;
-	for (i = 0; i < used_sacks; i++) {
-		u32 start_seq = sp[i].start_seq;
-		u32 end_seq = sp[i].end_seq;
-
-		if (i == 0) {
-			if (tp->recv_sack_cache[i].start_seq != start_seq)
-				force_one_sack = 0;
-		} else {
-			if ((tp->recv_sack_cache[i].start_seq != start_seq) ||
-			    (tp->recv_sack_cache[i].end_seq != end_seq))
-				force_one_sack = 0;
-		}
-		tp->recv_sack_cache[i].start_seq = start_seq;
-		tp->recv_sack_cache[i].end_seq = end_seq;
-	}
-	/* Clear the rest of the cache sack blocks so they won't match mistakenly. */
-	for (; i < ARRAY_SIZE(tp->recv_sack_cache); i++) {
-		tp->recv_sack_cache[i].start_seq = 0;
-		tp->recv_sack_cache[i].end_seq = 0;
-	}
-
-	if (force_one_sack)
-		used_sacks = 1;
-	else {
-		int j;
-		tp->fastpath_skb_hint = NULL;
-
-		/* order SACK blocks to allow in order walk of the retrans queue */
-		for (i = used_sacks - 1; i > 0; i--) {
-			for (j = 0; j < i; j++){
-				if (after(sp[j].start_seq, sp[j+1].start_seq)) {
-					struct tcp_sack_block tmp;
-
-					tmp = sp[j];
-					sp[j] = sp[j+1];
-					sp[j+1] = tmp;
-
-					/* Track where the first SACK block goes to */
-					if (j == first_sack_index)
-						first_sack_index = j+1;
-				}
-
+	/* order SACK blocks to allow in order walk of the retrans queue */
+	for (i = used_sacks - 1; i > 0; i--) {
+		for (j = 0; j < i; j++){
+			if (after(sp[j].start_seq, sp[j+1].start_seq)) {
+				struct tcp_sack_block tmp;
+
+				tmp = sp[j];
+				sp[j] = sp[j+1];
+				sp[j+1] = tmp;
+				
+				/* Track where the first SACK block goes to */
+				if (j == first_sack_index)
+					first_sack_index = j+1;
 			}
 		}
 	}
 
-	/* Use SACK fastpath hint if valid */
-	cached_skb = tp->fastpath_skb_hint;
-	cached_fack_count = tp->fastpath_cnt_hint;
-	if (!cached_skb) {
-		cached_skb = tcp_write_queue_head(sk);
-		cached_fack_count = 0;
+	skb = tcp_write_queue_head(sk);
+	fack_count = 0;
+	i = 0;
+
+	if (!tp->sacked_out) {
+		/* It's already past, so skip checking against it */
+		cache = tp->recv_sack_cache + ARRAY_SIZE(tp->recv_sack_cache);
+	} else {
+		cache = tp->recv_sack_cache;
+		/* Skip empty blocks in at head of the cache */
+		while (tcp_sack_cache_ok(tp, cache) && !cache->start_seq &&
+		       !cache->end_seq)
+			cache++;
 	}
 
-	for (i = 0; i < used_sacks; i++) {
-		struct sk_buff *skb;
+	while (i < used_sacks) {
 		u32 start_seq = sp[i].start_seq;
 		u32 end_seq = sp[i].end_seq;
-		int fack_count;
 		int dup_sack = (found_dup_sack && (i == first_sack_index));
-		int next_dup = (found_dup_sack && (i+1 == first_sack_index));
-
-		skb = cached_skb;
-		fack_count = cached_fack_count;
+		struct tcp_sack_block *next_dup = NULL;
+		
+		if (found_dup_sack && ((i + 1) == first_sack_index))
+			next_dup = &sp[i + 1];
 
 		/* Event "B" in the comment above. */
 		if (after(end_seq, tp->high_seq))
 			flag |= FLAG_DATA_LOST;
 
-		tcp_for_write_queue_from(skb, sk) {
-			int in_sack = 0;
-
-			if (skb == tcp_send_head(sk))
-				break;
-
-			cached_skb = skb;
-			cached_fack_count = fack_count;
-			if (i == first_sack_index) {
-				tp->fastpath_skb_hint = skb;
-				tp->fastpath_cnt_hint = fack_count;
+		/* Skip too early cached blocks */
+		while (tcp_sack_cache_ok(tp, cache) &&
+		       !before(start_seq, cache->end_seq))
+			cache++;
+
+		/* Can skip some work by looking recv_sack_cache? */
+		if (tcp_sack_cache_ok(tp, cache) && !dup_sack &&
+		    after(end_seq, cache->start_seq)) {
+
+			/* Head todo? */
+			if (before(start_seq, cache->start_seq)) {
+				skb = tcp_sacktag_skip(skb, sk, start_seq);
+				skb = tcp_sacktag_walk(skb, sk, next_dup, start_seq,
+						       cache->start_seq, dup_sack,
+						       &fack_count, &reord, &flag);
 			}
 
-			/* The retransmission queue is always in order, so
-			 * we can short-circuit the walk early.
-			 */
-			if (!before(TCP_SKB_CB(skb)->seq, end_seq))
-				break;
-
-			dup_sack = (found_dup_sack && (i == first_sack_index));
+			/* Rest of the block already fully processed? */
+			if (!after(end_seq, cache->end_seq)) {
+				skb = tcp_maybe_skipping_dsack(skb, sk, next_dup, cache->end_seq,
+							       &fack_count, &reord, &flag);
+				goto advance_sp;
+			}
 
-			/* Due to sorting DSACK may reside within this SACK block! */
-			if (next_dup) {
-				u32 dup_start = sp[i+1].start_seq;
-				u32 dup_end = sp[i+1].end_seq;
+			/* ...tail remains todo... */
+			if (TCP_SKB_CB(tp->highest_sack)->end_seq == cache->end_seq) {
+				/* ...but better entrypoint exists! Check that DSACKs are
+				 * properly accounted while skipping here
+				 */
+				tcp_maybe_skipping_dsack(skb, sk, next_dup, cache->end_seq,
+							 &fack_count, &reord, &flag);
 
-				if (before(TCP_SKB_CB(skb)->seq, dup_end)) {
-					in_sack = tcp_match_skb_to_sack(sk, skb, dup_start, dup_end);
-					if (in_sack > 0)
-						dup_sack = 1;
-				}
+				skb = tcp_write_queue_next(sk, tp->highest_sack);
+				fack_count = tp->fackets_out;
+				cache++;
+				goto walk;
 			}
 
-			/* DSACK info lost if out-of-mem, try SACK still */
-			if (in_sack <= 0)
-				in_sack = tcp_match_skb_to_sack(sk, skb, start_seq, end_seq);
-			if (unlikely(in_sack < 0))
-				break;
-
-			if (in_sack)
-				flag |= tcp_sacktag_one(skb, tp, &reord, dup_sack, fack_count);
+			skb = tcp_sacktag_skip(skb, sk, cache->end_seq);
+			/* Check overlap against next cached too (past this one already) */
+			cache++;
+			continue;
+		}
 
-			fack_count += tcp_skb_pcount(skb);
+		if (!before(start_seq, tcp_highest_sack_seq(tp))) {
+			skb = tcp_write_queue_next(sk, tp->highest_sack);
+			fack_count = tp->fackets_out;
 		}
+		skb = tcp_sacktag_skip(skb, sk, start_seq);
+
+walk:
+		skb = tcp_sacktag_walk(skb, sk, next_dup, start_seq, end_seq,
+				       dup_sack, &fack_count, &reord, &flag);
 
+advance_sp:
 		/* SACK enhanced FRTO (RFC4138, Appendix B): Clearing correct
 		 * due to in-order walk
 		 */
 		if (after(end_seq, tp->frto_highmark))
 			flag &= ~FLAG_ONLY_ORIG_SACKED;
+
+		i++;
 	}
 
+	/* Clear the head of the cache sack blocks so we can skip it next time */
+	for (i = 0; i < ARRAY_SIZE(tp->recv_sack_cache) - used_sacks; i++) {
+		tp->recv_sack_cache[i].start_seq = 0;
+		tp->recv_sack_cache[i].end_seq = 0;
+	}
+	for (j = 0; j < used_sacks; j++)
+		tp->recv_sack_cache[i++] = sp[j];
+
 	flag |= tcp_mark_lost_retrans(sk);
 
 	tcp_verify_left_out(tp);
@@ -2818,9 +2891,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 *seq_rtt_p,
 		}
 
 		tp->fackets_out -= min(pkts_acked, tp->fackets_out);
-		/* hint's skb might be NULL but we don't need to care */
-		tp->fastpath_cnt_hint -= min_t(u32, pkts_acked,
-					       tp->fastpath_cnt_hint);
+
 		if (ca_ops->pkts_acked) {
 			s32 rtt_us = -1;
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a5863f9..b0ece43 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -653,9 +653,7 @@ static void tcp_set_skb_tso_segs(struct sock *sk, struct sk_buff *skb, unsigned
 }
 
 /* When a modification to fackets out becomes necessary, we need to check
- * skb is counted to fackets_out or not. Another important thing is to
- * tweak SACK fastpath hint too as it would overwrite all changes unless
- * hint is also changed.
+ * skb is counted to fackets_out or not.
  */
 static void tcp_adjust_fackets_out(struct sock *sk, struct sk_buff *skb,
 				   int decr)
@@ -667,11 +665,6 @@ static void tcp_adjust_fackets_out(struct sock *sk, struct sk_buff *skb,
 
 	if (!before(tcp_highest_sack_seq(tp), TCP_SKB_CB(skb)->seq))
 		tp->fackets_out -= decr;
-
-	/* cnt_hint is "off-by-one" compared with fackets_out (see sacktag) */
-	if (tp->fastpath_skb_hint != NULL &&
-	    after(TCP_SKB_CB(tp->fastpath_skb_hint)->seq, TCP_SKB_CB(skb)->seq))
-		tp->fastpath_cnt_hint -= decr;
 }
 
 /* Function to create two new TCP segments.  Shrinks the given segment
@@ -1761,11 +1754,6 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *skb, int m
 
 		/* changed transmit queue under us so clear hints */
 		tcp_clear_retrans_hints_partial(tp);
-		/* manually tune sacktag skb hint */
-		if (tp->fastpath_skb_hint == next_skb) {
-			tp->fastpath_skb_hint = skb;
-			tp->fastpath_cnt_hint -= tcp_skb_pcount(skb);
-		}
 
 		sk_stream_free_skb(sk, next_skb);
 	}
-- 
1.5.0.6


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 10/10] [TCP]: Track sacktag (DEVEL PATCH)
  2007-11-15 13:37                 ` [RFC PATCH 09/10] [TCP]: Rewrite SACK block processing & sack_recv_cache use Ilpo Järvinen
@ 2007-11-15 13:37                   ` Ilpo Järvinen
  2007-11-16  4:00                   ` [RFC PATCH 09/10] [TCP]: Rewrite SACK block processing & sack_recv_cache use David Miller
  1 sibling, 0 replies; 23+ messages in thread
From: Ilpo Järvinen @ 2007-11-15 13:37 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

This is not intented to go to mainline, provided just for those
who are interested enough about the algorithm internals during
a test.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/linux/snmp.h |   19 +++++++++++++++++++
 net/ipv4/proc.c      |   19 +++++++++++++++++++
 net/ipv4/tcp_input.c |   50 ++++++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 86 insertions(+), 2 deletions(-)

diff --git a/include/linux/snmp.h b/include/linux/snmp.h
index 89f0c2b..fbcd62d 100644
--- a/include/linux/snmp.h
+++ b/include/linux/snmp.h
@@ -214,6 +214,25 @@ enum
 	LINUX_MIB_TCPDSACKIGNOREDOLD,		/* TCPSACKIgnoredOld */
 	LINUX_MIB_TCPDSACKIGNOREDNOUNDO,	/* TCPSACKIgnoredNoUndo */
 	LINUX_MIB_TCPSPURIOUSRTOS,		/* TCPSpuriousRTOs */
+	LINUX_MIB_TCP_SACK0,
+	LINUX_MIB_TCP_SACK1,
+	LINUX_MIB_TCP_SACK2,
+	LINUX_MIB_TCP_SACK3,
+	LINUX_MIB_TCP_SACK4,
+	LINUX_MIB_TCP_WALKEDSKBS,
+	LINUX_MIB_TCP_WALKEDDSACKS,
+	LINUX_MIB_TCP_SKIPPEDSKBS,
+	LINUX_MIB_TCP_NOCACHE,
+	LINUX_MIB_TCP_HEADWALK,
+	LINUX_MIB_TCP_FULLSKIP,
+	LINUX_MIB_TCP_TAILSKIP,
+	LINUX_MIB_TCP_HEADSKIP_TOHIGH,
+	LINUX_MIB_TCP_TAIL_TOHIGH,
+	LINUX_MIB_TCP_HEADSKIP,
+	LINUX_MIB_TCP_NEWSKIP,
+	LINUX_MIB_TCP_FULLWALK,
+	LINUX_MIB_TCP_TAILWALK,
+	LINUX_MIB_TCP_CACHEREMAINING,
 	__LINUX_MIB_MAX
 };
 
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index ce34b28..a5e842d 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -227,6 +227,25 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPDSACKIgnoredOld", LINUX_MIB_TCPDSACKIGNOREDOLD),
 	SNMP_MIB_ITEM("TCPDSACKIgnoredNoUndo", LINUX_MIB_TCPDSACKIGNOREDNOUNDO),
 	SNMP_MIB_ITEM("TCPSpuriousRTOs", LINUX_MIB_TCPSPURIOUSRTOS),
+	SNMP_MIB_ITEM("TCP_SACK0", LINUX_MIB_TCP_SACK0),
+	SNMP_MIB_ITEM("TCP_SACK1", LINUX_MIB_TCP_SACK1),
+	SNMP_MIB_ITEM("TCP_SACK2", LINUX_MIB_TCP_SACK2),
+	SNMP_MIB_ITEM("TCP_SACK3", LINUX_MIB_TCP_SACK3),
+	SNMP_MIB_ITEM("TCP_SACK4", LINUX_MIB_TCP_SACK4),
+	SNMP_MIB_ITEM("TCP_WALKEDSKBS", LINUX_MIB_TCP_WALKEDSKBS),
+	SNMP_MIB_ITEM("TCP_WALKEDDSACKS", LINUX_MIB_TCP_WALKEDDSACKS),
+	SNMP_MIB_ITEM("TCP_SKIPPEDSKBS", LINUX_MIB_TCP_SKIPPEDSKBS),
+	SNMP_MIB_ITEM("TCP_NOCACHE", LINUX_MIB_TCP_NOCACHE),
+	SNMP_MIB_ITEM("TCP_FULLWALK", LINUX_MIB_TCP_FULLWALK),
+	SNMP_MIB_ITEM("TCP_HEADWALK", LINUX_MIB_TCP_HEADWALK),
+	SNMP_MIB_ITEM("TCP_TAILWALK", LINUX_MIB_TCP_TAILWALK),
+	SNMP_MIB_ITEM("TCP_FULLSKIP", LINUX_MIB_TCP_FULLSKIP),
+	SNMP_MIB_ITEM("TCP_TAILSKIP", LINUX_MIB_TCP_TAILSKIP),
+	SNMP_MIB_ITEM("TCP_HEADSKIP", LINUX_MIB_TCP_HEADSKIP),
+	SNMP_MIB_ITEM("TCP_HEADSKIP_TOHIGH", LINUX_MIB_TCP_HEADSKIP_TOHIGH),
+	SNMP_MIB_ITEM("TCP_TAIL_TOHIGH", LINUX_MIB_TCP_TAIL_TOHIGH),
+	SNMP_MIB_ITEM("TCP_NEWSKIP", LINUX_MIB_TCP_NEWSKIP),
+	SNMP_MIB_ITEM("TCP_CACHEREMAINING", LINUX_MIB_TCP_CACHEREMAINING),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5833b01..87ab327 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1370,6 +1370,10 @@ static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk,
 			*flag |= tcp_sacktag_one(skb, tp, reord, dup_sack, *fack_count);
 
 		*fack_count += tcp_skb_pcount(skb);
+
+		NET_INC_STATS_BH(LINUX_MIB_TCP_WALKEDSKBS);
+		if (dup_sack)
+			NET_INC_STATS_BH(LINUX_MIB_TCP_WALKEDDSACKS);
 	}
 	return skb;
 }
@@ -1386,6 +1390,8 @@ static struct sk_buff *tcp_sacktag_skip(struct sk_buff *skb, struct sock *sk,
 
 		if (before(TCP_SKB_CB(skb)->end_seq, skip_to_seq))
 			break;
+
+		NET_INC_STATS_BH(LINUX_MIB_TCP_SKIPPEDSKBS);
 	}
 	return skb;
 }
@@ -1434,6 +1440,7 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 	int fack_count;
 	int i, j;
 	int first_sack_index;
+	int fullwalk = 1;
 
 	if (!tp->sacked_out) {
 		if (WARN_ON(tp->fackets_out))
@@ -1523,6 +1530,17 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 			cache++;
 	}
 
+	switch (used_sacks) {
+		case 0: NET_INC_STATS_BH(LINUX_MIB_TCP_SACK0); break;
+		case 1: NET_INC_STATS_BH(LINUX_MIB_TCP_SACK1); break;
+		case 2: NET_INC_STATS_BH(LINUX_MIB_TCP_SACK2); break;
+		case 3: NET_INC_STATS_BH(LINUX_MIB_TCP_SACK3); break;
+		case 4: NET_INC_STATS_BH(LINUX_MIB_TCP_SACK4); break;
+	}
+
+	if (!tcp_sack_cache_ok(tp, cache))
+		NET_INC_STATS_BH(LINUX_MIB_TCP_NOCACHE);
+
 	while (i < used_sacks) {
 		u32 start_seq = sp[i].start_seq;
 		u32 end_seq = sp[i].end_seq;
@@ -1544,6 +1562,9 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 		/* Can skip some work by looking recv_sack_cache? */
 		if (tcp_sack_cache_ok(tp, cache) && !dup_sack &&
 		    after(end_seq, cache->start_seq)) {
+			int headskip = 0;
+
+			fullwalk = 0;
 
 			/* Head todo? */
 			if (before(start_seq, cache->start_seq)) {
@@ -1551,12 +1572,18 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 				skb = tcp_sacktag_walk(skb, sk, next_dup, start_seq,
 						       cache->start_seq, dup_sack,
 						       &fack_count, &reord, &flag);
-			}
+				NET_INC_STATS_BH(LINUX_MIB_TCP_HEADWALK);
+			} else
+				headskip = 1;
 
 			/* Rest of the block already fully processed? */
 			if (!after(end_seq, cache->end_seq)) {
 				skb = tcp_maybe_skipping_dsack(skb, sk, next_dup, cache->end_seq,
 							       &fack_count, &reord, &flag);
+				if (headskip)
+					NET_INC_STATS_BH(LINUX_MIB_TCP_FULLSKIP);
+				else
+					NET_INC_STATS_BH(LINUX_MIB_TCP_TAILSKIP);
 				goto advance_sp;
 			}
 
@@ -1571,24 +1598,37 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 				skb = tcp_write_queue_next(sk, tp->highest_sack);
 				fack_count = tp->fackets_out;
 				cache++;
+
+				if (headskip)
+					NET_INC_STATS_BH(LINUX_MIB_TCP_HEADSKIP_TOHIGH);
+				else
+					NET_INC_STATS_BH(LINUX_MIB_TCP_TAIL_TOHIGH);
 				goto walk;
 			}
 
 			skb = tcp_sacktag_skip(skb, sk, cache->end_seq);
 			/* Check overlap against next cached too (past this one already) */
 			cache++;
+
+			if (headskip)
+				NET_INC_STATS_BH(LINUX_MIB_TCP_HEADSKIP);
 			continue;
 		}
 
 		if (!before(start_seq, tcp_highest_sack_seq(tp))) {
 			skb = tcp_write_queue_next(sk, tp->highest_sack);
 			fack_count = tp->fackets_out;
+			NET_INC_STATS_BH(LINUX_MIB_TCP_NEWSKIP);
 		}
 		skb = tcp_sacktag_skip(skb, sk, start_seq);
 
 walk:
 		skb = tcp_sacktag_walk(skb, sk, next_dup, start_seq, end_seq,
 				       dup_sack, &fack_count, &reord, &flag);
+		if (fullwalk)
+			NET_INC_STATS_BH(LINUX_MIB_TCP_FULLWALK);
+		else
+			NET_INC_STATS_BH(LINUX_MIB_TCP_TAILWALK);
 
 advance_sp:
 		/* SACK enhanced FRTO (RFC4138, Appendix B): Clearing correct
@@ -1598,15 +1638,21 @@ advance_sp:
 			flag &= ~FLAG_ONLY_ORIG_SACKED;
 
 		i++;
+		fullwalk = 1;
 	}
 
+	if (tcp_sack_cache_ok(tp, cache))
+		NET_INC_STATS_BH(LINUX_MIB_TCP_CACHEREMAINING);
+
 	/* Clear the head of the cache sack blocks so we can skip it next time */
 	for (i = 0; i < ARRAY_SIZE(tp->recv_sack_cache) - used_sacks; i++) {
 		tp->recv_sack_cache[i].start_seq = 0;
 		tp->recv_sack_cache[i].end_seq = 0;
 	}
-	for (j = 0; j < used_sacks; j++)
+	for (j = 0; j < used_sacks; j++) {
+		WARN_ON(i >= ARRAY_SIZE(tp->recv_sack_cache));
 		tp->recv_sack_cache[i++] = sp[j];
+	}
 
 	flag |= tcp_mark_lost_retrans(sk);
 
-- 
1.5.0.6


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 01/10] [TCP]: Move !in_sack test earlier in sacktag & reorganize if()s
  2007-11-15 13:37 ` [PATCH 01/10] [TCP]: Move !in_sack test earlier in sacktag & reorganize if()s Ilpo Järvinen
  2007-11-15 13:37   ` [PATCH 02/10] [TCP]: Extend reordering detection to cover CA_Loss partially Ilpo Järvinen
@ 2007-11-16  3:33   ` David Miller
  1 sibling, 0 replies; 23+ messages in thread
From: David Miller @ 2007-11-16  3:33 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Thu, 15 Nov 2007 15:37:25 +0200

> All intermediate conditions include it already, make them
> simpler as well.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Applied to net-2.6.25

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 02/10] [TCP]: Extend reordering detection to cover CA_Loss partially
  2007-11-15 13:37   ` [PATCH 02/10] [TCP]: Extend reordering detection to cover CA_Loss partially Ilpo Järvinen
  2007-11-15 13:37     ` [PATCH 3/10] [TCP]: non-FACK SACK follows conservative SACK loss recovery Ilpo Järvinen
@ 2007-11-16  3:35     ` David Miller
  1 sibling, 0 replies; 23+ messages in thread
From: David Miller @ 2007-11-16  3:35 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Thu, 15 Nov 2007 15:37:26 +0200

> This implements more accurately what is stated in sacktag's
> overall comment:
> 
>   "Both of these heuristics are not used in Loss state, when
>    we cannot account for retransmits accurately."
> 
> When CA_Loss state is entered, the state changer ensures that
> undo_marker is only set if no TCPCB_RETRANS skbs were found,
> thus having non-zero undo_marker in CA_Loss basically tells
> that the R-bits still accurately reflect the current state
> of TCP.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Applied to net-2.6.25, thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/10] [TCP]: non-FACK SACK follows conservative SACK loss recovery
  2007-11-15 13:37     ` [PATCH 3/10] [TCP]: non-FACK SACK follows conservative SACK loss recovery Ilpo Järvinen
  2007-11-15 13:37       ` [PATCH 04/10] [TCP]: Convert highest_sack to sk_buff to allow direct access Ilpo Järvinen
@ 2007-11-16  3:41       ` David Miller
  1 sibling, 0 replies; 23+ messages in thread
From: David Miller @ 2007-11-16  3:41 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Thu, 15 Nov 2007 15:37:27 +0200

> Many assumptions that are true when no reordering or other
> strange events happen are not a part of the RFC3517. FACK
> implementation is based on such assumptions. Previously (before
> the rewrite) the non-FACK SACK was basically doing fast rexmit
> and then it times out all skbs when first cumulative ACK arrives,
> which cannot really be called SACK based recovery :-).
> 
> RFC3517 SACK disables these things:
> - Per SKB timeouts & head timeout entry to recovery
> - Marking at least one skb while in recovery (RFC3517 does this
>   only for the fast retransmission but not for the other skbs
>   when cumulative ACKs arrive in the recovery)
> - Sacktag's loss detection flavors B and C (see comment before
>   tcp_sacktag_write_queue)
> 
> This does not implement the "last resort" rule 3 of NextSeg, which
> allows retransmissions also when not enough SACK blocks have yet
> arrived above a segment for IsLost to return true [RFC3517].
> 
> The implementation differs from RFC3517 in these points:
> - Rate-halving is used instead of FlightSize / 2
> - Instead of using dupACKs to trigger the recovery, the number
>   of SACK blocks is used as FACK does with SACK blocks+holes
>   (which provides more accurate number). It seems that the
>   difference can affect negatively only if the receiver does not
>   generate SACK blocks at all even though it claimed to be
>   SACK-capable.
> - Dupthresh is not a constant one. Dynamical adjustments include
>   both holes and sacked segments (equal to what FACK has) due to
>   complexity involved in determining the number sacked blocks
>   between highest_sack and the reordered segment. Thus it's will
>   be an over-estimate.
> 
> Implementation note:
> 
> tcp_clean_rtx_queue doesn't need a lost_cnt tweak because head
> skb at that point cannot be SACKED_ACKED (nor would such
> situation last for long enough to cause problems).
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Thanks a lot for doing this work, these changes look fine to
me.

It occurs to me that the loss engine basically runs in about
2 or 3 modes, and instead of making the same tests multiple
times through the ACK processing paths we might want to move
to some kind of 'tcp_loss_ops' scheme.

It is just an idea.

Patch applied to net-2.6.25, thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 04/10] [TCP]: Convert highest_sack to sk_buff to allow direct access
  2007-11-15 13:37       ` [PATCH 04/10] [TCP]: Convert highest_sack to sk_buff to allow direct access Ilpo Järvinen
  2007-11-15 13:37         ` [PATCH 05/10] [TCP]: Make lost retrans detection more self-contained Ilpo Järvinen
@ 2007-11-16  3:42         ` David Miller
  1 sibling, 0 replies; 23+ messages in thread
From: David Miller @ 2007-11-16  3:42 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Thu, 15 Nov 2007 15:37:28 +0200

> It is going to replace the sack fastpath hint quite soon... :-)
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

I always liked this patch, how can I not apply it? :-)

Added to net-2.6.25, thanks Ilpo!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/10] [TCP]: Make lost retrans detection more self-contained
  2007-11-15 13:37         ` [PATCH 05/10] [TCP]: Make lost retrans detection more self-contained Ilpo Järvinen
  2007-11-15 13:37           ` [PATCH 06/10] [TCP]: Prior_fackets can be replaced by highest_sack seq Ilpo Järvinen
@ 2007-11-16  3:43           ` David Miller
  1 sibling, 0 replies; 23+ messages in thread
From: David Miller @ 2007-11-16  3:43 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Thu, 15 Nov 2007 15:37:29 +0200

> Highest_sack_end_seq is no longer calculated in the loop,
> thus it can be pushed to the worker function altogether
> making that function independent of the sacktag.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Looks good, applied.

Thanks Ilpo.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 06/10] [TCP]: Prior_fackets can be replaced by highest_sack seq
  2007-11-15 13:37           ` [PATCH 06/10] [TCP]: Prior_fackets can be replaced by highest_sack seq Ilpo Järvinen
  2007-11-15 13:37             ` [PATCH 07/10] [TCP]: Create tcp_sacktag_one() Ilpo Järvinen
@ 2007-11-16  3:44             ` David Miller
  1 sibling, 0 replies; 23+ messages in thread
From: David Miller @ 2007-11-16  3:44 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Thu, 15 Nov 2007 15:37:30 +0200

> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Every simplification like this is like one less potential
bug to fix :-)

Patch applied to net-2.6.25, thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 07/10] [TCP]: Create tcp_sacktag_one().
  2007-11-15 13:37             ` [PATCH 07/10] [TCP]: Create tcp_sacktag_one() Ilpo Järvinen
  2007-11-15 13:37               ` [PATCH 08/10] [TCP]: Earlier SACK block verification & simplify access to them Ilpo Järvinen
@ 2007-11-16  3:45               ` David Miller
  1 sibling, 0 replies; 23+ messages in thread
From: David Miller @ 2007-11-16  3:45 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Thu, 15 Nov 2007 15:37:31 +0200

> Worker function that implements the main logic of
> the inner-most loop of tcp_sacktag_write_queue().
> 
> Idea was originally presented by David S. Miller.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

How wonderful this change did not rot forever :-)

Thanks for keeping it alive, applied to net-2.6.25, thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 08/10] [TCP]: Earlier SACK block verification & simplify access to them
  2007-11-15 13:37               ` [PATCH 08/10] [TCP]: Earlier SACK block verification & simplify access to them Ilpo Järvinen
  2007-11-15 13:37                 ` [RFC PATCH 09/10] [TCP]: Rewrite SACK block processing & sack_recv_cache use Ilpo Järvinen
@ 2007-11-16  3:50                 ` David Miller
  1 sibling, 0 replies; 23+ messages in thread
From: David Miller @ 2007-11-16  3:50 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Thu, 15 Nov 2007 15:37:32 +0200

> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

This is fine, and I can see how it sets things up for the
new "fastpath" code.  Applied to net-2.6.25, thanks!

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 09/10] [TCP]: Rewrite SACK block processing & sack_recv_cache use
  2007-11-15 13:37                 ` [RFC PATCH 09/10] [TCP]: Rewrite SACK block processing & sack_recv_cache use Ilpo Järvinen
  2007-11-15 13:37                   ` [PATCH 10/10] [TCP]: Track sacktag (DEVEL PATCH) Ilpo Järvinen
@ 2007-11-16  4:00                   ` David Miller
  2007-11-16 13:44                     ` Ilpo Järvinen
  1 sibling, 1 reply; 23+ messages in thread
From: David Miller @ 2007-11-16  4:00 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Thu, 15 Nov 2007 15:37:33 +0200

> Key points of this patch are:
> 
>   - In case new SACK information is advance only type, no skb
>     processing below previously discovered highest point is done
>   - Optimize cases below highest point too since there's no need
>     to always go up to highest point (which is very likely still
>     present in that SACK), this is not entirely true though
>     because I'm dropping the fastpath_skb_hint which could
>     previously optimize those cases even better. Whether that's
>     significant, I'm not too sure.
> 
> Corrently it will provide skipping by walking. Combined with
> RB-tree, all skipping would become fast too regardless of window
> size (can be done incrementally later).
> 
> Previously a number of cases in TCP SACK processing fails to
> take advantage of costly stored information in sack_recv_cache,
> most importantly, expected events such as cumulative ACK and new
> hole ACKs. Processing on such ACKs result in rather long walks
> building up latencies (which easily gets nasty when window is
> huge). Those latencies are often completely unnecessary
> compared with the amount of _new_ information received, usually
> for cumulative ACK there's no new information at all, yet TCP
> walks whole queue unnecessary potentially taking a number of
> costly cache misses on the way, etc.!
> 
> Since the inclusion of highest_sack, there's a lot information
> that is very likely redundant (SACK fastpath hint stuff,
> fackets_out, highest_sack), though there's no ultimate guarantee
> that they'll remain the same whole the time (in all unearthly
> scenarios). Take advantage of this knowledge here and drop
> fastpath hint and use direct access to highest SACKed skb as
> a replacement.
> 
> Effectively "special cased" fastpath is dropped. This change
> adds some complexity to introduce better coveraged "fastpath",
> though the added complexity should make TCP behave more cache
> friendly.
> 
> The current ACK's SACK blocks are compared against each cached
> block individially and only ranges that are new are then scanned
> by the high constant walk. For other parts of write queue, even
> when in previously known part of the SACK blocks, a faster skip
> function is used (if necessary at all). In addition, whenever
> possible, TCP fast-forwards to highest_sack skb that was made
> available by an earlier patch. In typical case, no other things
> but this fast-forward and mandatory markings after that occur
> making the access pattern quite similar to the former fastpath
> "special case".
> 
> DSACKs are special case that must always be walked.
> 
> The local to recv_sack_cache copying could be more intelligent
> w.r.t DSACKs which are likely to be there only once but that
> is left to a separate patch.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

I like this patch a lot, and if there are any errors in it
they are in the details not the design, and we have lots of
time to sort that out before the 2.6.25 merge window so
I have applied this.

I have been thinking off an on about the core SACK cost problem.

One thing that keeps occuring to me is that to conquer this fully we
have to change the domain of the problem in our favor.

Basically, even if we have RB-trie and all of these nice fastpaths, we
still can walk the complete queue for a receiver which maliciously
advertises lots of SACK information in one ACK (say covering every
retransmit queue SKB except one) and then nearly no SACK information
in the next one.  Repeating this cycle we can get into excessive usage
of cpu cycles.

Only in situations of severe memory constraints will a receiver
release out-of-order packets and thus remove sequences from previously
advertised SACK blocks (which by RFCs we must handle).  And in such
cases it is OK to resort to timeout based recovery.

This simplifies everything and puts a very strict upper bound of
the amount of processing the most aggressive "bad guy" receiver
can impose upon us.  One full SKB queue scan per window/RTT.

With all of those other weird cases out of the way, there is only
one event to process: SACKS cover more data.

Ilpo what do you think about these ideas?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 09/10] [TCP]: Rewrite SACK block processing & sack_recv_cache use
  2007-11-16  4:00                   ` [RFC PATCH 09/10] [TCP]: Rewrite SACK block processing & sack_recv_cache use David Miller
@ 2007-11-16 13:44                     ` Ilpo Järvinen
  2007-11-20  6:08                       ` David Miller
  0 siblings, 1 reply; 23+ messages in thread
From: Ilpo Järvinen @ 2007-11-16 13:44 UTC (permalink / raw)
  To: David Miller; +Cc: Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 8482 bytes --]

On Thu, 15 Nov 2007, David Miller wrote:

> From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Thu, 15 Nov 2007 15:37:33 +0200
> 
> > Key points of this patch are:
> > 
> >   - In case new SACK information is advance only type, no skb
> >     processing below previously discovered highest point is done
> >   - Optimize cases below highest point too since there's no need
> >     to always go up to highest point (which is very likely still
> >     present in that SACK), this is not entirely true though
> >     because I'm dropping the fastpath_skb_hint which could
> >     previously optimize those cases even better. Whether that's
> >     significant, I'm not too sure.
> > 
> > Corrently it will provide skipping by walking. Combined with
> > RB-tree, all skipping would become fast too regardless of window
> > size (can be done incrementally later).
> > 
> > Previously a number of cases in TCP SACK processing fails to
> > take advantage of costly stored information in sack_recv_cache,
> > most importantly, expected events such as cumulative ACK and new
> > hole ACKs. Processing on such ACKs result in rather long walks
> > building up latencies (which easily gets nasty when window is
> > huge). Those latencies are often completely unnecessary
> > compared with the amount of _new_ information received, usually
> > for cumulative ACK there's no new information at all, yet TCP
> > walks whole queue unnecessary potentially taking a number of
> > costly cache misses on the way, etc.!
> > 
> > Since the inclusion of highest_sack, there's a lot information
> > that is very likely redundant (SACK fastpath hint stuff,
> > fackets_out, highest_sack), though there's no ultimate guarantee
> > that they'll remain the same whole the time (in all unearthly
> > scenarios). Take advantage of this knowledge here and drop
> > fastpath hint and use direct access to highest SACKed skb as
> > a replacement.
> > 
> > Effectively "special cased" fastpath is dropped. This change
> > adds some complexity to introduce better coveraged "fastpath",
> > though the added complexity should make TCP behave more cache
> > friendly.
> > 
> > The current ACK's SACK blocks are compared against each cached
> > block individially and only ranges that are new are then scanned
> > by the high constant walk. For other parts of write queue, even
> > when in previously known part of the SACK blocks, a faster skip
> > function is used (if necessary at all). In addition, whenever
> > possible, TCP fast-forwards to highest_sack skb that was made
> > available by an earlier patch. In typical case, no other things
> > but this fast-forward and mandatory markings after that occur
> > making the access pattern quite similar to the former fastpath
> > "special case".
> > 
> > DSACKs are special case that must always be walked.
> > 
> > The local to recv_sack_cache copying could be more intelligent
> > w.r.t DSACKs which are likely to be there only once but that
> > is left to a separate patch.
> > 
> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> 
> I like this patch a lot, and if there are any errors in it
> they are in the details not the design, and we have lots of
> time to sort that out before the 2.6.25 merge window so
> I have applied this.
> 
> I have been thinking off an on about the core SACK cost problem.

...Me too.

> One thing that keeps occuring to me is that to conquer this fully we
> have to change the domain of the problem in our favor.
>
> Basically, even if we have RB-trie and all of these nice fastpaths, we
> still can walk the complete queue for a receiver which maliciously
> advertises lots of SACK information in one ACK (say covering every
> retransmit queue SKB except one) and then nearly no SACK information
> in the next one.  Repeating this cycle we can get into excessive usage
> of cpu cycles.

I agree. I made this change not to optimize malicious cases but to
remove sub-optimalities that non-malicious flows experience in
_typical cases_, people are already hit by them in legitimate scenarios.

I think that we cannot prevent malicious guys from inventing ways to 
circumvent the tweak and optimization that may exists for the typical,
legitimate use cases. This patch falls to that category. Currently we
just don't know if removing the recv_sack_cache is going to cost too
much, that is should be kept. It can always be removed separately if
cache misses are not that bad.

...It may well turn out that this change wouldn't have been necessary
with the new approach we're now discussing, however, as long as the old 
code lives, the approach shouldn't hurt either (except the unsurity 
related to the fastpath_skb_hint != tp->highest_sack case).

...It's just that we're on quite unsure grounds when moving to something 
completely new, it's not some funny new protocol which nobody has used 
before, people have high expectations on reliability and how it such 
perform in their favorite, legimitate scenario. :-) Thus it's IMHO also 
about having a safe enough fallback available (without horrible revert 
resolution if there's a need to back off).

> Only in situations of severe memory constraints will a receiver
> release out-of-order packets and thus remove sequences from previously
> advertised SACK blocks (which by RFCs we must handle).  And in such
> cases it is OK to resort to timeout based recovery.

RFCs basically have specified only RTO recovery for that, they don't 
specify nor mention any other method (IIRC). Not that Linux' way is in 
violation of them but it likely rare enough that it may receive the most 
insignicant optimization award. Having that RTO break might actually help 
a legitimate receiver to sort out its problems more than hurt.

Alternatively it could be handled by adding CA_SackReneging state which 
would retransmit also sacked skb but there are tradeoffs in that approach 
too (either may result in suboptimal behavior by rexmitting also not 
discarded segments that are reported by SACK after RTO, or makes code more 
messy by using last ACK's SACK blocks to avoid that). ...It might still be 
useful for removing a need for costly rebuilding of RB-tree or whatever at 
RTO in such case though (for legitimate this is really a corner case and 
malicious can still push sender to that extra work too easily).

> This simplifies everything and puts a very strict upper bound of
> the amount of processing the most aggressive "bad guy" receiver
> can impose upon us.  One full SKB queue scan per window/RTT.

Upper-bound we lack, yeah. But I'd say that's somewhat underestimating
the problem because CA_Recovery is going cost another "scan", unless 
somebody is going to combine that to SACK processing as well (sounds 
scary). Luckily, due to cache effects, it's likely going to be a cheaper 
scan.

> With all of those other weird cases out of the way, there is only
> one event to process: SACKS cover more data.
>
> Ilpo what do you think about these ideas?

Besides the log(n) search for which patches already exists, only O(new 
information) scan start..end range would be necessary. Or had you 
something else in mind (I'm not that sure what the "ideas", in plural,
were :-))? ...I haven't forgotten the RB-tree things either (and I'm
well aware of the limited scope of things this patch addresses).

One idea probably not mentioned earlier: I've thought (ab)using skb's 
end_seq to lazily point to the end of SACK blocks but that's probably 
going to a dead-end because of the laziness.

Couple of things that are still on the way making things not that 
straighforward. DSACKs which are significant for already SACKed if 
EVER_RETRANS bit is not set and for S|R skbs, and lost_retrans loop
is interested in S|R skbs too... Have you thought how to solve them
if the only optimized event would be "SACKS cover more data"?

...Also there's the third problem which may constrain us quite a lot. 
Any additional cache misses to the typical, legitimate, use case might be 
too much wrt. high expectation of the current users. They have just very 
few skb accesses currently (and I made it hopefully even better with
this patch ;-)). Adding log(n) cache misses to that per ACK with SACK, may 
well be significant, though it may be closer to log(outstanding not acked) 
which is considerably smaller on average. Therefore something like I've 
now done in this patch might still be necessary.

...anyway, I'll have to get some work done today as well... :-)

-- 
 i.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 09/10] [TCP]: Rewrite SACK block processing & sack_recv_cache use
  2007-11-16 13:44                     ` Ilpo Järvinen
@ 2007-11-20  6:08                       ` David Miller
  2007-11-20  9:22                         ` Ilpo Järvinen
  0 siblings, 1 reply; 23+ messages in thread
From: David Miller @ 2007-11-20  6:08 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Fri, 16 Nov 2007 15:44:40 +0200 (EET)

> Besides the log(n) search for which patches already exists, only O(new 
> information) scan start..end range would be necessary. Or had you 
> something else in mind (I'm not that sure what the "ideas", in plural,
> were :-))? ...I haven't forgotten the RB-tree things either (and I'm
> well aware of the limited scope of things this patch addresses).

Since you brought it up, can you respin the RB-tree patches?  Let's
just put them into net-2.6.25 and let them cook in there for a long
time.

If they prove to be too expensive in the no-loss case we can do
something like lazy tree creation or simply improving the
insert/delete implmentation.

I think it's important to put RB-tree in because a lot of the ideas
we are discussing have different implications if we go one way or
the other so we should commit to something now in order to focus
our efforts better.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC PATCH 09/10] [TCP]: Rewrite SACK block processing & sack_recv_cache use
  2007-11-20  6:08                       ` David Miller
@ 2007-11-20  9:22                         ` Ilpo Järvinen
  0 siblings, 0 replies; 23+ messages in thread
From: Ilpo Järvinen @ 2007-11-20  9:22 UTC (permalink / raw)
  To: David Miller; +Cc: Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3300 bytes --]

On Mon, 19 Nov 2007, David Miller wrote:

> From: "Ilpo_Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Fri, 16 Nov 2007 15:44:40 +0200 (EET)
> 
> > Besides the log(n) search for which patches already exists, only O(new 
> > information) scan start..end range would be necessary. Or had you 
> > something else in mind (I'm not that sure what the "ideas", in plural,
> > were :-))? ...I haven't forgotten the RB-tree things either (and I'm
> > well aware of the limited scope of things this patch addresses).
> 
> Since you brought it up, can you respin the RB-tree patches?

Ok, I've the rb-tree directly in my tree. I'll check the fack_count 
patches since Tom (IIRC, with a long last name :-)) made some fixes 
to that.

I've also already done initial combining of things in the new lost marker 
(include the bug fixes & cleanups to the original ones) but its 
optimizations are awaiting the solution that is made for sacktag which may 
reduce need for them a lot.

> Let's just put them into net-2.6.25 and let them cook in there for
> a  long time.
> 
> If they prove to be too expensive in the no-loss case we can do
> something like lazy tree creation or simply improving the
> insert/delete implmentation.
>
> I think it's important to put RB-tree in because a lot of the ideas
> we are discussing have different implications if we go one way or
> the other so we should commit to something now in order to focus
> our efforts better.

I was thinking of trying a split approach where SACKed not yet SACKed 
would reside in a different trees. I've also observed that most of the 
range scan operations are insensitive to ordering, and if that's not yet 
fully implemented, they can deal with the mixed ordering fine after minor 
modifications. Therefore access to ->next available seems overkill 
requirement as well (would make dropping list easier). However, it's a bit 
complex to implement tree walker for a range because RB-tree is going to 
live underneath the walker (at least in clean_rtx_queue and sacktag) :-); 
if anyone knows some resources where such issue is dealt it might help, 
pointer would be nice :-). ...Maybe I just use list in the first step.

But we're still bound to extra walking for a while if a made up large 
DSACK block comes in. I was thinking of putting some sane limit to how 
many skbs the DSACK can deal (in normal case there shouldn't be no
more than two; two because of fragmenting cause by other dir SACK
blocks, with MTU probes up 2 times that seems valid).

For the record, another idea I had (have SACK-lowtree and others trees
here):

For each SACK block
	search_after_or_equal start_seq in SACK-lowtree
	if after end_seq
		search from others tree, move skb to SACK-lowtree
		block done
	endif

	loop
		search_before_or_equal end_seq in SACK-lowtree
		if they were the same skb
			update the skb in SACK-lowtree
			block done
		endif

		move end_seq result skb back to others tree
	end loop
endfor

But the laziness didn't help that much (would need to know R-bits; L-bits 
could be derived) and malicious guys could still do crazy thing by telling 
no S-bit, S, no S-bit, S, ... pattern for the whole window first and then 
SACKing them all in a single ACK => uh-oh, would have to do all those 
lowtree => others moves for 1/2 of the window.

-- 
 i.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2007-11-20  9:23 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-15 13:37 [RFC PATCH net-2.6.25 0/10] [TCP]: Cleanups, tweaks & sacktag recode Ilpo Järvinen
2007-11-15 13:37 ` [PATCH 01/10] [TCP]: Move !in_sack test earlier in sacktag & reorganize if()s Ilpo Järvinen
2007-11-15 13:37   ` [PATCH 02/10] [TCP]: Extend reordering detection to cover CA_Loss partially Ilpo Järvinen
2007-11-15 13:37     ` [PATCH 3/10] [TCP]: non-FACK SACK follows conservative SACK loss recovery Ilpo Järvinen
2007-11-15 13:37       ` [PATCH 04/10] [TCP]: Convert highest_sack to sk_buff to allow direct access Ilpo Järvinen
2007-11-15 13:37         ` [PATCH 05/10] [TCP]: Make lost retrans detection more self-contained Ilpo Järvinen
2007-11-15 13:37           ` [PATCH 06/10] [TCP]: Prior_fackets can be replaced by highest_sack seq Ilpo Järvinen
2007-11-15 13:37             ` [PATCH 07/10] [TCP]: Create tcp_sacktag_one() Ilpo Järvinen
2007-11-15 13:37               ` [PATCH 08/10] [TCP]: Earlier SACK block verification & simplify access to them Ilpo Järvinen
2007-11-15 13:37                 ` [RFC PATCH 09/10] [TCP]: Rewrite SACK block processing & sack_recv_cache use Ilpo Järvinen
2007-11-15 13:37                   ` [PATCH 10/10] [TCP]: Track sacktag (DEVEL PATCH) Ilpo Järvinen
2007-11-16  4:00                   ` [RFC PATCH 09/10] [TCP]: Rewrite SACK block processing & sack_recv_cache use David Miller
2007-11-16 13:44                     ` Ilpo Järvinen
2007-11-20  6:08                       ` David Miller
2007-11-20  9:22                         ` Ilpo Järvinen
2007-11-16  3:50                 ` [PATCH 08/10] [TCP]: Earlier SACK block verification & simplify access to them David Miller
2007-11-16  3:45               ` [PATCH 07/10] [TCP]: Create tcp_sacktag_one() David Miller
2007-11-16  3:44             ` [PATCH 06/10] [TCP]: Prior_fackets can be replaced by highest_sack seq David Miller
2007-11-16  3:43           ` [PATCH 05/10] [TCP]: Make lost retrans detection more self-contained David Miller
2007-11-16  3:42         ` [PATCH 04/10] [TCP]: Convert highest_sack to sk_buff to allow direct access David Miller
2007-11-16  3:41       ` [PATCH 3/10] [TCP]: non-FACK SACK follows conservative SACK loss recovery David Miller
2007-11-16  3:35     ` [PATCH 02/10] [TCP]: Extend reordering detection to cover CA_Loss partially David Miller
2007-11-16  3:33   ` [PATCH 01/10] [TCP]: Move !in_sack test earlier in sacktag & reorganize if()s David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).