[PATCH 0/n]: 2 is better than 1 - tcp recombining; both with SACK and rexmits

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/n]: 2 is better than 1 - tcp recombining; both with SACK and rexmits
@ 2008-11-24 14:21 Ilpo Järvinen
  2008-11-24 14:21 ` [PATCH 01/10] tcp: collapse more than two on retransmission Ilpo Järvinen
  0 siblings, 1 reply; 28+ messages in thread
From: Ilpo Järvinen @ 2008-11-24 14:21 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Skb recombining is finally there for SACK, after fixing some accesses
to already freed memory and accounting issues and plenty of state/data
corrupters and some corner-cases which I didn't found by hitting them,
it seems to finally make enough sense and run flawlessly. But still
please read the skb part of the shifting with more care than usual
since I'm on a territory which is pretty unknown to me though I learned
a lot while figuring things out. E.g., do I set checksum fields correctly,
are pre-condition checks missing something important, etc.

The current approach already is enough to get most of the benefits from
recombining. There's still room for some improvements but they can be
built on top of this machinery once we're sure that it indeed works.
E.g, not all sack patterns are handled as good as they could be, full
shift special case optimizations, dsack/sack for already sacked skbs
w/o fragmenting, ... Just to mention some.

NOTE: Like I said before, there is a pending problem with tcpdump
getting EFAULT for some reason if cloned skbs get handled.

Also the rexmissions combining is included (comments to your concerns
in the original thread). ...and some minor code reorganization.
NOTE: there's one obvious MIBs patch for debug only purpose for those
interested enough.

--
 i.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 01/10] tcp: collapse more than two on retransmission
  2008-11-24 14:21 [PATCH 0/n]: 2 is better than 1 - tcp recombining; both with SACK and rexmits Ilpo Järvinen
@ 2008-11-24 14:21 ` Ilpo Järvinen
  2008-11-24 14:21   ` [PATCH 02/10] tcp: move tcp_simple_retransmit to tcp_input Ilpo Järvinen
  2008-11-25  5:05   ` [PATCH 01/10] tcp: collapse more than two on retransmission David Miller
  0 siblings, 2 replies; 28+ messages in thread
From: Ilpo Järvinen @ 2008-11-24 14:21 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ilpo J�rvinen

I always had thought that collapsing up to two at a time was
intentional decision to avoid excessive processing if 1 byte
sized skbs are to be combined for a full mtu, and consecutive
retransmissions would make the size of the retransmittee
double each round anyway, but some recent discussion made me
to understand that was not the case. Thus make collapse work
more and wait less.

It would be possible to take advantage of the shifting
machinery (added in the later patch) in the case of paged
data but that can be implemented on top of this change.

tcp_skb_is_last check is now provided by the loop.

I tested a bit (ss-after-idle-off, fill 4096x4096B xfer,
10s sleep + 4096 x 1byte writes while dropping them for
some a while with netem):

. 16774097:16775545(1448) ack 1 win 46
. 16775545:16776993(1448) ack 1 win 46
. ack 16759617 win 2399
P 16776993:16777217(224) ack 1 win 46
. ack 16762513 win 2399
. ack 16765409 win 2399
. ack 16768305 win 2399
. ack 16771201 win 2399
. ack 16774097 win 2399
. ack 16776993 win 2399
. ack 16777217 win 2399
P 16777217:16777257(40) ack 1 win 46
. ack 16777257 win 2399
P 16777257:16778705(1448) ack 1 win 46
P 16778705:16780153(1448) ack 1 win 46
FP 16780153:16781313(1160) ack 1 win 46
. ack 16778705 win 2399
. ack 16780153 win 2399
F 1:1(0) ack 16781314 win 2399

While without drop-all period I get this:

. 16773585:16775033(1448) ack 1 win 46
. ack 16764897 win 9367
. ack 16767793 win 9367
. ack 16770689 win 9367
. ack 16773585 win 9367
. 16775033:16776481(1448) ack 1 win 46
P 16776481:16777217(736) ack 1 win 46
. ack 16776481 win 9367
. ack 16777217 win 9367
P 16777217:16777218(1) ack 1 win 46
P 16777218:16777219(1) ack 1 win 46
P 16777219:16777220(1) ack 1 win 46
  ...
P 16777247:16777248(1) ack 1 win 46
. ack 16777218 win 9367
. ack 16777219 win 9367
  ...
. ack 16777233 win 9367
. ack 16777248 win 9367
P 16777248:16778696(1448) ack 1 win 46
P 16778696:16780144(1448) ack 1 win 46
FP 16780144:16781313(1169) ack 1 win 46
. ack 16780144 win 9367
F 1:1(0) ack 16781314 win 9367

The window seems to be 30-40 segments, which were successfully
combined into: P 16777217:16777257(40) ack 1 win 46

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_output.c |   96 ++++++++++++++++++++++++++++++-------------------
 1 files changed, 59 insertions(+), 37 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a524627..86ef989 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1766,46 +1766,22 @@ u32 __tcp_select_window(struct sock *sk)
 	return window;
 }
 
-/* Attempt to collapse two adjacent SKB's during retransmission. */
-static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *skb,
-				     int mss_now)
+/* Collapses two adjacent SKB's during retransmission. */
+static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *next_skb = tcp_write_queue_next(sk, skb);
 	int skb_size, next_skb_size;
 	u16 flags;
 
-	/* The first test we must make is that neither of these two
-	 * SKB's are still referenced by someone else.
-	 */
-	if (skb_cloned(skb) || skb_cloned(next_skb))
-		return;
-
 	skb_size = skb->len;
 	next_skb_size = next_skb->len;
 	flags = TCP_SKB_CB(skb)->flags;
 
-	/* Also punt if next skb has been SACK'd. */
-	if (TCP_SKB_CB(next_skb)->sacked & TCPCB_SACKED_ACKED)
-		return;
-
-	/* Next skb is out of window. */
-	if (after(TCP_SKB_CB(next_skb)->end_seq, tcp_wnd_end(tp)))
-		return;
-
-	/* Punt if not enough space exists in the first SKB for
-	 * the data in the second, or the total combined payload
-	 * would exceed the MSS.
-	 */
-	if ((next_skb_size > skb_tailroom(skb)) ||
-	    ((skb_size + next_skb_size) > mss_now))
-		return;
-
 	BUG_ON(tcp_skb_pcount(skb) != 1 || tcp_skb_pcount(next_skb) != 1);
 
 	tcp_highest_sack_combine(sk, next_skb, skb);
 
-	/* Ok.	We will be able to collapse the packet. */
 	tcp_unlink_write_queue(next_skb, sk);
 
 	skb_copy_from_linear_data(next_skb, skb_put(skb, next_skb_size),
@@ -1847,6 +1823,62 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *skb,
 	sk_wmem_free_skb(sk, next_skb);
 }
 
+static int tcp_can_collapse(struct sock *sk, struct sk_buff *skb)
+{
+	if (tcp_skb_pcount(skb) > 1)
+		return 0;
+	/* TODO: SACK collapsing could be used to remove this condition */
+	if (skb_shinfo(skb)->nr_frags != 0)
+		return 0;
+	if (skb_cloned(skb))
+		return 0;
+	if (skb == tcp_send_head(sk))
+		return 0;
+	/* Some heurestics for collapsing over SACK'd could be invented */
+	if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)
+		return 0;
+
+	return 1;
+}
+
+static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
+				     int space)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb = to, *tmp;
+	int first = 1;
+
+	if (!sysctl_tcp_retrans_collapse)
+		return;
+	if (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_SYN)
+		return;
+
+	tcp_for_write_queue_from_safe(skb, tmp, sk) {
+		if (!tcp_can_collapse(sk, skb))
+			break;
+
+		space -= skb->len;
+
+		if (first) {
+			first = 0;
+			continue;
+		}
+
+		if (space < 0)
+			break;
+		/* Punt if not enough space exists in the first SKB for
+		 * the data in the second
+		 */
+		if (skb->len > skb_tailroom(to))
+			break;
+
+		if (after(TCP_SKB_CB(skb)->end_seq, tcp_wnd_end(tp)))
+			break;
+
+		tcp_collapse_retrans(sk, to);
+	}
+}
+
 /* Do a simple retransmit without using the backoff mechanisms in
  * tcp_timer. This is used for path mtu discovery.
  * The socket is already locked here.
@@ -1946,17 +1978,7 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 			return -ENOMEM; /* We'll try again later. */
 	}
 
-	/* Collapse two adjacent packets if worthwhile and we can. */
-	if (!(TCP_SKB_CB(skb)->flags & TCPCB_FLAG_SYN) &&
-	    (skb->len < (cur_mss >> 1)) &&
-	    (!tcp_skb_is_last(sk, skb)) &&
-	    (tcp_write_queue_next(sk, skb) != tcp_send_head(sk)) &&
-	    (skb_shinfo(skb)->nr_frags == 0 &&
-	     skb_shinfo(tcp_write_queue_next(sk, skb))->nr_frags == 0) &&
-	    (tcp_skb_pcount(skb) == 1 &&
-	     tcp_skb_pcount(tcp_write_queue_next(sk, skb)) == 1) &&
-	    (sysctl_tcp_retrans_collapse != 0))
-		tcp_retrans_try_collapse(sk, skb, cur_mss);
+	tcp_retrans_try_collapse(sk, skb, cur_mss);
 
 	/* Some Solaris stacks overoptimize and ignore the FIN on a
 	 * retransmit when old data is attached.  So strip it off
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 02/10] tcp: move tcp_simple_retransmit to tcp_input
  2008-11-24 14:21 ` [PATCH 01/10] tcp: collapse more than two on retransmission Ilpo Järvinen
@ 2008-11-24 14:21   ` Ilpo Järvinen
  2008-11-24 14:22     ` [PATCH 03/10] tcp: more aggressive skipping Ilpo Järvinen
                       ` (3 more replies)
  2008-11-25  5:05   ` [PATCH 01/10] tcp: collapse more than two on retransmission David Miller
  1 sibling, 4 replies; 28+ messages in thread
From: Ilpo Järvinen @ 2008-11-24 14:21 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ilpo J�rvinen

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/net/tcp.h     |    2 -
 net/ipv4/tcp_input.c  |   50 +++++++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp_output.c |   50 -------------------------------------------------
 3 files changed, 50 insertions(+), 52 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 8f26b28..90b4c3b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -472,8 +472,6 @@ extern void tcp_send_delayed_ack(struct sock *sk);
 
 /* tcp_input.c */
 extern void tcp_cwnd_application_limited(struct sock *sk);
-extern void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp,
-					    struct sk_buff *skb);
 
 /* tcp_timer.c */
 extern void tcp_init_xmit_timers(struct sock *);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 097294b..01b7458 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2559,6 +2559,56 @@ static void tcp_mtup_probe_success(struct sock *sk, struct sk_buff *skb)
 	tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
 }
 
+/* Do a simple retransmit without using the backoff mechanisms in
+ * tcp_timer. This is used for path mtu discovery.
+ * The socket is already locked here.
+ */
+void tcp_simple_retransmit(struct sock *sk)
+{
+	const struct inet_connection_sock *icsk = inet_csk(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb;
+	unsigned int mss = tcp_current_mss(sk, 0);
+	u32 prior_lost = tp->lost_out;
+
+	tcp_for_write_queue(skb, sk) {
+		if (skb == tcp_send_head(sk))
+			break;
+		if (skb->len > mss &&
+		    !(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)) {
+			if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS) {
+				TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
+				tp->retrans_out -= tcp_skb_pcount(skb);
+			}
+			tcp_skb_mark_lost_uncond_verify(tp, skb);
+		}
+	}
+
+	tcp_clear_retrans_hints_partial(tp);
+
+	if (prior_lost == tp->lost_out)
+		return;
+
+	if (tcp_is_reno(tp))
+		tcp_limit_reno_sacked(tp);
+
+	tcp_verify_left_out(tp);
+
+	/* Don't muck with the congestion window here.
+	 * Reason is that we do not increase amount of _data_
+	 * in network, but units changed and effective
+	 * cwnd/ssthresh really reduced now.
+	 */
+	if (icsk->icsk_ca_state != TCP_CA_Loss) {
+		tp->high_seq = tp->snd_nxt;
+		tp->snd_ssthresh = tcp_current_ssthresh(sk);
+		tp->prior_ssthresh = 0;
+		tp->undo_marker = 0;
+		tcp_set_ca_state(sk, TCP_CA_Loss);
+	}
+	tcp_xmit_retransmit_queue(sk);
+}
+
 /* Process an event, which can update packets-in-flight not trivially.
  * Main goal of this function is to calculate new estimate for left_out,
  * taking into account both packets sitting in receiver's buffer and
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 86ef989..c069ecb 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1879,56 +1879,6 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
 	}
 }
 
-/* Do a simple retransmit without using the backoff mechanisms in
- * tcp_timer. This is used for path mtu discovery.
- * The socket is already locked here.
- */
-void tcp_simple_retransmit(struct sock *sk)
-{
-	const struct inet_connection_sock *icsk = inet_csk(sk);
-	struct tcp_sock *tp = tcp_sk(sk);
-	struct sk_buff *skb;
-	unsigned int mss = tcp_current_mss(sk, 0);
-	u32 prior_lost = tp->lost_out;
-
-	tcp_for_write_queue(skb, sk) {
-		if (skb == tcp_send_head(sk))
-			break;
-		if (skb->len > mss &&
-		    !(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)) {
-			if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS) {
-				TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
-				tp->retrans_out -= tcp_skb_pcount(skb);
-			}
-			tcp_skb_mark_lost_uncond_verify(tp, skb);
-		}
-	}
-
-	tcp_clear_retrans_hints_partial(tp);
-
-	if (prior_lost == tp->lost_out)
-		return;
-
-	if (tcp_is_reno(tp))
-		tcp_limit_reno_sacked(tp);
-
-	tcp_verify_left_out(tp);
-
-	/* Don't muck with the congestion window here.
-	 * Reason is that we do not increase amount of _data_
-	 * in network, but units changed and effective
-	 * cwnd/ssthresh really reduced now.
-	 */
-	if (icsk->icsk_ca_state != TCP_CA_Loss) {
-		tp->high_seq = tp->snd_nxt;
-		tp->snd_ssthresh = tcp_current_ssthresh(sk);
-		tp->prior_ssthresh = 0;
-		tp->undo_marker = 0;
-		tcp_set_ca_state(sk, TCP_CA_Loss);
-	}
-	tcp_xmit_retransmit_queue(sk);
-}
-
 /* This retransmits one SKB.  Policy decisions and retransmit queue
  * state updates are done by the caller.  Returns non-zero if an
  * error occurred which prevented the send.
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 03/10] tcp: more aggressive skipping
  2008-11-24 14:21   ` [PATCH 02/10] tcp: move tcp_simple_retransmit to tcp_input Ilpo Järvinen
@ 2008-11-24 14:22     ` Ilpo Järvinen
  2008-11-24 14:22       ` [PATCH 04/10] tcp: Make SACK code to split only at mss boundaries Ilpo Järvinen
  2008-11-25  5:12       ` [PATCH 03/10] tcp: more aggressive skipping David Miller
  2008-11-24 14:50     ` [PATCH 02/10] tcp: move tcp_simple_retransmit to tcp_input Ilpo Järvinen
                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 28+ messages in thread
From: Ilpo Järvinen @ 2008-11-24 14:22 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ilpo J�rvinen

I knew already when rewriting the sacktag that this condition
was too conservative, change it now since it prevent lot of
useless work (especially in the sack shifter decision code
that is being added by a later patch). This shouldn't change
anything really, just save some processing regardless of the
shifter.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 01b7458..84ca4f7 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1401,7 +1401,7 @@ static struct sk_buff *tcp_sacktag_skip(struct sk_buff *skb, struct sock *sk,
 		if (skb == tcp_send_head(sk))
 			break;
 
-		if (!before(TCP_SKB_CB(skb)->end_seq, skip_to_seq))
+		if (after(TCP_SKB_CB(skb)->end_seq, skip_to_seq))
 			break;
 
 		*fack_count += tcp_skb_pcount(skb);
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 04/10] tcp: Make SACK code to split only at mss boundaries
  2008-11-24 14:22     ` [PATCH 03/10] tcp: more aggressive skipping Ilpo Järvinen
@ 2008-11-24 14:22       ` Ilpo Järvinen
  2008-11-24 14:22         ` [PATCH 05/10] tcp: make tcp_sacktag_one able to handle partial skb too Ilpo Järvinen
  2008-11-25  5:13         ` [PATCH 04/10] tcp: Make SACK code to split only at mss boundaries David Miller
  2008-11-25  5:12       ` [PATCH 03/10] tcp: more aggressive skipping David Miller
  1 sibling, 2 replies; 28+ messages in thread
From: Ilpo Järvinen @ 2008-11-24 14:22 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ilpo J�rvinen

Sadly enough, this adds possible divide though we try to avoid
it by checking one mss as common case.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |   27 +++++++++++++++++++++++----
 1 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 84ca4f7..5e1e121 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1247,20 +1247,39 @@ static int tcp_match_skb_to_sack(struct sock *sk, struct sk_buff *skb,
 {
 	int in_sack, err;
 	unsigned int pkt_len;
+	unsigned int mss;
 
 	in_sack = !after(start_seq, TCP_SKB_CB(skb)->seq) &&
 		  !before(end_seq, TCP_SKB_CB(skb)->end_seq);
 
 	if (tcp_skb_pcount(skb) > 1 && !in_sack &&
 	    after(TCP_SKB_CB(skb)->end_seq, start_seq)) {
-
+	    	mss = tcp_skb_mss(skb);
 		in_sack = !after(start_seq, TCP_SKB_CB(skb)->seq);
 
-		if (!in_sack)
+		if (!in_sack) {
 			pkt_len = start_seq - TCP_SKB_CB(skb)->seq;
-		else
+			if (pkt_len < mss)
+				pkt_len = mss;
+		} else {
 			pkt_len = end_seq - TCP_SKB_CB(skb)->seq;
-		err = tcp_fragment(sk, skb, pkt_len, skb_shinfo(skb)->gso_size);
+			if (pkt_len < mss)
+				return -EINVAL;
+		}
+
+		/* Round if necessary so that SACKs cover only full MSSes
+		 * and/or the remaining small portion (if present)
+		 */
+		if (pkt_len > mss) {
+			unsigned int new_len = (pkt_len / mss) * mss;
+			if (!in_sack && new_len < pkt_len) {
+				new_len += mss;
+				if (new_len > skb->len)
+					return 0;
+			}
+			pkt_len = new_len;
+		}
+		err = tcp_fragment(sk, skb, pkt_len, mss);
 		if (err < 0)
 			return err;
 	}
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 05/10] tcp: make tcp_sacktag_one able to handle partial skb too
  2008-11-24 14:22       ` [PATCH 04/10] tcp: Make SACK code to split only at mss boundaries Ilpo Järvinen
@ 2008-11-24 14:22         ` Ilpo Järvinen
  2008-11-24 14:22           ` [PATCH 06/10] tcp: Try to restore large SKBs while SACK processing Ilpo Järvinen
  2008-11-25  5:15           ` [PATCH 05/10] tcp: make tcp_sacktag_one able to handle partial skb too David Miller
  2008-11-25  5:13         ` [PATCH 04/10] tcp: Make SACK code to split only at mss boundaries David Miller
  1 sibling, 2 replies; 28+ messages in thread
From: Ilpo Järvinen @ 2008-11-24 14:22 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ilpo J�rvinen

This is preparatory work for SACK combiner patch which may
have to count TCP state changes for only a part of the skb
because it will intentionally avoids splitting skb to SACKed
and not sacked parts.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |   32 +++++++++++++++++---------------
 1 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5e1e121..94533e4 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1288,7 +1288,8 @@ static int tcp_match_skb_to_sack(struct sock *sk, struct sk_buff *skb,
 }
 
 static int tcp_sacktag_one(struct sk_buff *skb, struct sock *sk,
-			   int *reord, int dup_sack, int fack_count)
+			   int *reord, int dup_sack, int fack_count,
+			   u8 *sackedto, int pcount)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	u8 sacked = TCP_SKB_CB(skb)->sacked;
@@ -1313,10 +1314,9 @@ static int tcp_sacktag_one(struct sk_buff *skb, struct sock *sk,
 			 * that retransmission is still in flight.
 			 */
 			if (sacked & TCPCB_LOST) {
-				TCP_SKB_CB(skb)->sacked &=
-					~(TCPCB_LOST|TCPCB_SACKED_RETRANS);
-				tp->lost_out -= tcp_skb_pcount(skb);
-				tp->retrans_out -= tcp_skb_pcount(skb);
+				*sackedto &= ~(TCPCB_LOST|TCPCB_SACKED_RETRANS);
+				tp->lost_out -= pcount;
+				tp->retrans_out -= pcount;
 			}
 		} else {
 			if (!(sacked & TCPCB_RETRANS)) {
@@ -1333,22 +1333,22 @@ static int tcp_sacktag_one(struct sk_buff *skb, struct sock *sk,
 			}
 
 			if (sacked & TCPCB_LOST) {
-				TCP_SKB_CB(skb)->sacked &= ~TCPCB_LOST;
-				tp->lost_out -= tcp_skb_pcount(skb);
+				*sackedto &= ~TCPCB_LOST;
+				tp->lost_out -= pcount;
 			}
 		}
 
-		TCP_SKB_CB(skb)->sacked |= TCPCB_SACKED_ACKED;
+		*sackedto |= TCPCB_SACKED_ACKED;
 		flag |= FLAG_DATA_SACKED;
-		tp->sacked_out += tcp_skb_pcount(skb);
+		tp->sacked_out += pcount;
 
-		fack_count += tcp_skb_pcount(skb);
+		fack_count += pcount;
 
 		/* Lost marker hint past SACKed? Tweak RFC3517 cnt */
 		if (!tcp_is_fack(tp) && (tp->lost_skb_hint != NULL) &&
 		    before(TCP_SKB_CB(skb)->seq,
 			   TCP_SKB_CB(tp->lost_skb_hint)->seq))
-			tp->lost_cnt_hint += tcp_skb_pcount(skb);
+			tp->lost_cnt_hint += pcount;
 
 		if (fack_count > tp->fackets_out)
 			tp->fackets_out = fack_count;
@@ -1361,9 +1361,9 @@ static int tcp_sacktag_one(struct sk_buff *skb, struct sock *sk,
 	 * frames and clear it. undo_retrans is decreased above, L|R frames
 	 * are accounted above as well.
 	 */
-	if (dup_sack && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS)) {
-		TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
-		tp->retrans_out -= tcp_skb_pcount(skb);
+	if (dup_sack && (*sackedto & TCPCB_SACKED_RETRANS)) {
+		*sackedto &= ~TCPCB_SACKED_RETRANS;
+		tp->retrans_out -= pcount;
 	}
 
 	return flag;
@@ -1403,7 +1403,9 @@ static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk,
 
 		if (in_sack)
 			*flag |= tcp_sacktag_one(skb, sk, reord, dup_sack,
-						 *fack_count);
+						 *fack_count,
+						 &(TCP_SKB_CB(skb)->sacked),
+						 tcp_skb_pcount(skb));
 
 		*fack_count += tcp_skb_pcount(skb);
 	}
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 06/10] tcp: Try to restore large SKBs while SACK processing
  2008-11-24 14:22         ` [PATCH 05/10] tcp: make tcp_sacktag_one able to handle partial skb too Ilpo Järvinen
@ 2008-11-24 14:22           ` Ilpo Järvinen
  2008-11-24 14:22             ` [PATCH 07/10] tcp: Make shifting not clear the hints Ilpo Järvinen
  2008-11-25  5:20             ` [PATCH 06/10] tcp: Try to restore large SKBs while SACK processing David Miller
  2008-11-25  5:15           ` [PATCH 05/10] tcp: make tcp_sacktag_one able to handle partial skb too David Miller
  1 sibling, 2 replies; 28+ messages in thread
From: Ilpo Järvinen @ 2008-11-24 14:22 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ilpo J�rvinen

During SACK processing, most of the benefits of TSO are eaten by
the SACK blocks that one-by-one fragment SKBs to MSS sized chunks.
Then we're in problems when cleanup work for them has to be done
when a large cumulative ACK comes. Try to return back to pre-split
state already while more and more SACK info gets discovered by
combining newly discovered SACK areas with the previous skb if
that's SACKed as well.

This approach has a number of benefits:

1) The processing overhead is spread more equally over the RTT
2) Write queue has less skbs to process (affect everything
   which has to walk in the queue past the sacked areas)
3) Write queue is consistent whole the time, so no other parts
   of TCP has to be aware of this (this was not the case with
   some other approach that was, well, quite intrusive all
   around).
4) Clean_rtx_queue can release most of the pages using single
   put_page instead of previous PAGE_SIZE/mss+1 calls

In case a hole is fully filled by the new SACK block, we attempt
to combine the next skb too which allows construction of skbs
that are even larger than what tso split them to and it handles
hole per on every nth patterns that often occur during slow start
overshoot pretty nicely. Though this to be really useful also
a retransmission would have to get lost since cumulative ACKs
advance one hole at a time in the most typical case.

TODO: handle upwards only merging. That should be rather easy
when segment is fully sacked but I'm leaving that as future
work item (it won't make very large difference anyway since
this current approach already covers quite a lot of normal
cases).

I was earlier thinking of some sophisticated way of tracking
timestamps of the first and the last segment but later on
realized that it won't be that necessary at all to store the
timestamp of the last segment. The cases that can occur are
basically either:
  1) ambiguous => no sensible measurement can be taken anyway
  2) non-ambiguous is due to reordering => having the timestamp
     of the last segment there is just skewing things more off
     than does some good since the ack got triggered by one of
     the holes (besides some substle issues that would make
     determining right hole/skb even harder problem). Anyway,
     it has nothing to do with this change then.

I choose to route some abnormal looking cases with goto noop,
some could be handled differently (eg., by stopping the
walking at that skb but again). In general, they either
shouldn't happen at all or are rare enough to make no difference
in practice.

In theory this change (as whole) could cause some macroscale
regression (global) because of cache misses that are taken over
the round-trip time but it gets very likely better because of much
less (local) cache misses per other write queue walkers and the
big recovery clearing cumulative ack.

Worth to note that these benefits would be very easy to get also
without TSO/GSO being on as long as the data is in pages so that
we can merge them. Currently I won't let that happen because
DSACK splitting at fragment that would mess up pcounts due to
sk_can_gso in tcp_set_skb_tso_segs. Once DSACKs fragments gets
avoided, we have some conditions that can be made less strict.

TODO: I will probably have to convert the excessive pointer
passing to struct sacktag_state... :-)

My testing revealed that considerable amount of skbs couldn't
be shifted because they were cloned (most likely still awaiting
tx reclaim)...

[The rest is considering future work instead since I got
repeatably EFAULT to tcpdump's recvfrom when I added
pskb_expand_head to deal with clones, so I separated that
into another, later patch]

...To counter that, I gave up on the fifth advantage:

5) When growing previous SACK block, less allocs for new skbs
   are done, basically a new alloc is needed only when new hole
   is detected and when the previous skb runs out of frags space

...which now only happens of if reclaim is fast enough to dispose
the clone before the SACK block comes in (the window is RTT long),
otherwise we'll have to alloc some.

With clones being handled I got these numbers (will be somewhat
worse without that), taken with fine-grained mibs:

                  TCPSackShifted 398
                   TCPSackMerged 877
            TCPSackShiftFallback 320
      TCPSACKCOLLAPSEFALLBACKGSO 0
  TCPSACKCOLLAPSEFALLBACKSKBBITS 0
  TCPSACKCOLLAPSEFALLBACKSKBDATA 0
    TCPSACKCOLLAPSEFALLBACKBELOW 0
    TCPSACKCOLLAPSEFALLBACKFIRST 1
 TCPSACKCOLLAPSEFALLBACKPREVBITS 318
      TCPSACKCOLLAPSEFALLBACKMSS 1
   TCPSACKCOLLAPSEFALLBACKNOHEAD 0
    TCPSACKCOLLAPSEFALLBACKSHIFT 0
          TCPSACKCOLLAPSENOOPSEQ 0
  TCPSACKCOLLAPSENOOPSMALLPCOUNT 0
     TCPSACKCOLLAPSENOOPSMALLLEN 0
             TCPSACKCOLLAPSEHOLE 12

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/linux/skbuff.h |   33 ++++++
 include/net/tcp.h      |    5 +
 net/core/skbuff.c      |  140 ++++++++++++++++++++++++++
 net/ipv4/tcp_input.c   |  256 ++++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 427 insertions(+), 7 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a01b6f8..acf17af 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -493,6 +493,19 @@ static inline bool skb_queue_is_last(const struct sk_buff_head *list,
 }
 
 /**
+ *	skb_queue_is_first - check if skb is the first entry in the queue
+ *	@list: queue head
+ *	@skb: buffer
+ *
+ *	Returns true if @skb is the first buffer on the list.
+ */
+static inline bool skb_queue_is_first(const struct sk_buff_head *list,
+				      const struct sk_buff *skb)
+{
+	return (skb->prev == (struct sk_buff *) list);
+}
+
+/**
  *	skb_queue_next - return the next packet in the queue
  *	@list: queue head
  *	@skb: current buffer
@@ -511,6 +524,24 @@ static inline struct sk_buff *skb_queue_next(const struct sk_buff_head *list,
 }
 
 /**
+ *	skb_queue_prev - return the prev packet in the queue
+ *	@list: queue head
+ *	@skb: current buffer
+ *
+ *	Return the prev packet in @list before @skb.  It is only valid to
+ *	call this if skb_queue_is_first() evaluates to false.
+ */
+static inline struct sk_buff *skb_queue_prev(const struct sk_buff_head *list,
+					     const struct sk_buff *skb)
+{
+	/* This BUG_ON may seem severe, but if we just return then we
+	 * are going to dereference garbage.
+	 */
+	BUG_ON(skb_queue_is_first(list, skb));
+	return skb->prev;
+}
+
+/**
  *	skb_get - reference buffer
  *	@skb: buffer to reference
  *
@@ -1652,6 +1683,8 @@ extern int             skb_splice_bits(struct sk_buff *skb,
 extern void	       skb_copy_and_csum_dev(const struct sk_buff *skb, u8 *to);
 extern void	       skb_split(struct sk_buff *skb,
 				 struct sk_buff *skb1, const u32 len);
+extern int	       skb_shift(struct sk_buff *tgt, struct sk_buff *skb,
+				 int shiftlen);
 
 extern struct sk_buff *skb_segment(struct sk_buff *skb, int features);
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 90b4c3b..2653924 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1192,6 +1192,11 @@ static inline struct sk_buff *tcp_write_queue_next(struct sock *sk, struct sk_bu
 	return skb_queue_next(&sk->sk_write_queue, skb);
 }
 
+static inline struct sk_buff *tcp_write_queue_prev(struct sock *sk, struct sk_buff *skb)
+{
+	return skb_queue_prev(&sk->sk_write_queue, skb);
+}
+
 #define tcp_for_write_queue(skb, sk)					\
 	skb_queue_walk(&(sk)->sk_write_queue, skb)
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 267185a..844b8ab 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2018,6 +2018,146 @@ void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len)
 		skb_split_no_header(skb, skb1, len, pos);
 }
 
+/* Shifting from/to a cloned skb is a no-go.
+ *
+ * TODO: handle cloned skbs by using pskb_expand_head()
+ */
+static int skb_prepare_for_shift(struct sk_buff *skb)
+{
+	return skb_cloned(skb);
+}
+
+/**
+ * skb_shift - Shifts paged data partially from skb to another
+ * @tgt: buffer into which tail data gets added
+ * @skb: buffer from which the paged data comes from
+ * @shiftlen: shift up to this many bytes
+ *
+ * Attempts to shift up to shiftlen worth of bytes, which may be less than
+ * the length of the skb, from tgt to skb. Returns number bytes shifted.
+ * It's up to caller to free skb if everything was shifted.
+ *
+ * If @tgt runs out of frags, the whole operation is aborted.
+ *
+ * Skb cannot include anything else but paged data while tgt is allowed
+ * to have non-paged data as well.
+ *
+ * TODO: full sized shift could be optimized but that would need
+ * specialized skb free'er to handle frags without up-to-date nr_frags.
+ */
+int skb_shift(struct sk_buff *tgt, struct sk_buff *skb, int shiftlen)
+{
+	int from, to, merge, todo;
+	struct skb_frag_struct *fragfrom, *fragto;
+
+	BUG_ON(shiftlen > skb->len);
+	BUG_ON(skb_headlen(skb));	/* Would corrupt stream */
+
+	todo = shiftlen;
+	from = 0;
+	to = skb_shinfo(tgt)->nr_frags;
+	fragfrom = &skb_shinfo(skb)->frags[from];
+
+	/* Actual merge is delayed until the point when we know we can
+	 * commit all, so that we don't have to undo partial changes
+	 */
+	if (!to ||
+	    !skb_can_coalesce(tgt, to, fragfrom->page, fragfrom->page_offset)) {
+		merge = -1;
+	} else {
+		merge = to - 1;
+
+		todo -= fragfrom->size;
+		if (todo < 0) {
+			if (skb_prepare_for_shift(skb) ||
+			    skb_prepare_for_shift(tgt))
+				return 0;
+
+			fragto = &skb_shinfo(tgt)->frags[merge];
+
+			fragto->size += shiftlen;
+			fragfrom->size -= shiftlen;
+			fragfrom->page_offset += shiftlen;
+
+			goto onlymerged;
+		}
+
+		from++;
+	}
+
+	/* Skip full, not-fitting skb to avoid expensive operations */
+	if ((shiftlen == skb->len) &&
+	    (skb_shinfo(skb)->nr_frags - from) > (MAX_SKB_FRAGS - to))
+		return 0;
+
+	if (skb_prepare_for_shift(skb) || skb_prepare_for_shift(tgt))
+		return 0;
+
+	while ((todo > 0) && (from < skb_shinfo(skb)->nr_frags)) {
+		if (to == MAX_SKB_FRAGS)
+			return 0;
+
+		fragfrom = &skb_shinfo(skb)->frags[from];
+		fragto = &skb_shinfo(tgt)->frags[to];
+
+		if (todo >= fragfrom->size) {
+			*fragto = *fragfrom;
+			todo -= fragfrom->size;
+			from++;
+			to++;
+
+		} else {
+			get_page(fragfrom->page);
+			fragto->page = fragfrom->page;
+			fragto->page_offset = fragfrom->page_offset;
+			fragto->size = todo;
+
+			fragfrom->page_offset += todo;
+			fragfrom->size -= todo;
+			todo = 0;
+
+			to++;
+			break;
+		}
+	}
+
+	/* Ready to "commit" this state change to tgt */
+	skb_shinfo(tgt)->nr_frags = to;
+
+	if (merge >= 0) {
+		fragfrom = &skb_shinfo(skb)->frags[0];
+		fragto = &skb_shinfo(tgt)->frags[merge];
+
+		fragto->size += fragfrom->size;
+		put_page(fragfrom->page);
+	}
+
+	/* Reposition in the original skb */
+	to = 0;
+	while (from < skb_shinfo(skb)->nr_frags)
+		skb_shinfo(skb)->frags[to++] = skb_shinfo(skb)->frags[from++];
+	skb_shinfo(skb)->nr_frags = to;
+
+	BUG_ON(todo > 0 && !skb_shinfo(skb)->nr_frags);
+
+onlymerged:
+	/* Most likely the tgt won't ever need its checksum anymore, skb on
+	 * the other hand might need it if it needs to be resent
+	 */
+	tgt->ip_summed = CHECKSUM_PARTIAL;
+	skb->ip_summed = CHECKSUM_PARTIAL;
+
+	/* Yak, is it really working this way? Some helper please? */
+	skb->len -= shiftlen;
+	skb->data_len -= shiftlen;
+	skb->truesize -= shiftlen;
+	tgt->len += shiftlen;
+	tgt->data_len += shiftlen;
+	tgt->truesize += shiftlen;
+
+	return shiftlen;
+}
+
 /**
  * skb_prepare_seq_read - Prepare a sequential read of skb data
  * @skb: the buffer to read
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 94533e4..8106421 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1241,6 +1241,8 @@ static int tcp_check_dsack(struct sock *sk, struct sk_buff *ack_skb,
  * aligned portion of it that matches. Therefore we might need to fragment
  * which may fail and creates some hassle (caller must handle error case
  * returns).
+ *
+ * FIXME: this could be merged to shift decision code
  */
 static int tcp_match_skb_to_sack(struct sock *sk, struct sk_buff *skb,
 				 u32 start_seq, u32 end_seq)
@@ -1352,9 +1354,6 @@ static int tcp_sacktag_one(struct sk_buff *skb, struct sock *sk,
 
 		if (fack_count > tp->fackets_out)
 			tp->fackets_out = fack_count;
-
-		if (!before(TCP_SKB_CB(skb)->seq, tcp_highest_sack_seq(tp)))
-			tcp_advance_highest_sack(sk, skb);
 	}
 
 	/* D-SACK. We can detect redundant retransmission in S|R and plain R
@@ -1369,12 +1368,231 @@ static int tcp_sacktag_one(struct sk_buff *skb, struct sock *sk,
 	return flag;
 }
 
+static int tcp_shifted_skb(struct sock *sk, struct sk_buff *prev,
+			   struct sk_buff *skb, unsigned int pcount,
+			   int shifted, int fack_count, int *reord,
+			   int *flag, int mss)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	u8 dummy_sacked = TCP_SKB_CB(skb)->sacked;	/* We discard results */
+
+	BUG_ON(!pcount);
+
+	TCP_SKB_CB(prev)->end_seq += shifted;
+	TCP_SKB_CB(skb)->seq += shifted;
+
+	skb_shinfo(prev)->gso_segs += pcount;
+	BUG_ON(skb_shinfo(skb)->gso_segs < pcount);
+	skb_shinfo(skb)->gso_segs -= pcount;
+
+	/* When we're adding to gso_segs == 1, gso_size will be zero,
+	 * in theory this shouldn't be necessary but as long as DSACK
+	 * code can come after this skb later on it's better to keep
+	 * setting gso_size to something.
+	 */
+	if (!skb_shinfo(prev)->gso_size) {
+		skb_shinfo(prev)->gso_size = mss;
+		skb_shinfo(prev)->gso_type = sk->sk_gso_type;
+	}
+
+	/* CHECKME: To clear or not to clear? Mimics normal skb currently */
+	if (skb_shinfo(skb)->gso_segs <= 1) {
+		skb_shinfo(skb)->gso_size = 0;
+		skb_shinfo(skb)->gso_type = 0;
+	}
+
+	*flag |= tcp_sacktag_one(skb, sk, reord, 0, fack_count, &dummy_sacked,
+				 pcount);
+
+	/* Difference in this won't matter, both ACKed by the same cumul. ACK */
+	TCP_SKB_CB(prev)->sacked |= (TCP_SKB_CB(skb)->sacked & TCPCB_EVER_RETRANS);
+
+	tcp_clear_all_retrans_hints(tp);
+
+	if (skb->len > 0) {
+		BUG_ON(!tcp_skb_pcount(skb));
+		return 0;
+	}
+
+	/* Whole SKB was eaten :-) */
+
+	TCP_SKB_CB(skb)->flags |= TCP_SKB_CB(prev)->flags;
+	if (skb == tcp_highest_sack(sk))
+		tcp_advance_highest_sack(sk, skb);
+
+	tcp_unlink_write_queue(skb, sk);
+	sk_wmem_free_skb(sk, skb);
+
+	return 1;
+}
+
+/* I wish gso_size would have a bit more sane initialization than
+ * something-or-zero which complicates things
+ */
+static int tcp_shift_mss(struct sk_buff *skb)
+{
+	int mss = tcp_skb_mss(skb);
+
+	if (!mss)
+		mss = skb->len;
+
+	return mss;
+}
+
+/* Shifting pages past head area doesn't work */
+static int skb_can_shift(struct sk_buff *skb)
+{
+	return !skb_headlen(skb) && skb_is_nonlinear(skb);
+}
+
+/* Try collapsing SACK blocks spanning across multiple skbs to a single
+ * skb.
+ */
+static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
+					  u32 start_seq, u32 end_seq,
+					  int dup_sack, int *fack_count,
+					  int *reord, int *flag)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *prev;
+	int mss;
+	int pcount = 0;
+	int len;
+	int in_sack;
+
+	if (!sk_can_gso(sk))
+		goto fallback;
+
+	/* Normally R but no L won't result in plain S */
+	if (!dup_sack &&
+	    (TCP_SKB_CB(skb)->sacked & TCPCB_TAGBITS) == TCPCB_SACKED_RETRANS)
+		goto fallback;
+	if (!skb_can_shift(skb))
+		goto fallback;
+	/* This frame is about to be dropped (was ACKed). */
+	if (!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
+		goto fallback;
+
+	/* Can only happen with delayed DSACK + discard craziness */
+	if (unlikely(skb == tcp_write_queue_head(sk)))
+		goto fallback;
+	prev = tcp_write_queue_prev(sk, skb);
+
+	if ((TCP_SKB_CB(prev)->sacked & TCPCB_TAGBITS) != TCPCB_SACKED_ACKED)
+		goto fallback;
+
+	in_sack = !after(start_seq, TCP_SKB_CB(skb)->seq) &&
+		  !before(end_seq, TCP_SKB_CB(skb)->end_seq);
+
+	if (in_sack) {
+		len = skb->len;
+		pcount = tcp_skb_pcount(skb);
+		mss = tcp_shift_mss(skb);
+
+		/* TODO: Fix DSACKs to not fragment already SACKed and we can
+		 * drop this restriction as unnecessary
+		 */
+		if (mss != tcp_shift_mss(prev))
+			goto fallback;
+	} else {
+		if (!after(TCP_SKB_CB(skb)->end_seq, start_seq))
+			goto noop;
+		/* CHECKME: This is non-MSS split case only?, this will
+		 * cause skipped skbs due to advancing loop btw, original
+		 * has that feature too
+		 */
+		if (tcp_skb_pcount(skb) <= 1)
+			goto noop;
+
+		in_sack = !after(start_seq, TCP_SKB_CB(skb)->seq);
+		if (!in_sack) {
+			/* TODO: head merge to next could be attempted here
+			 * if (!after(TCP_SKB_CB(skb)->end_seq, end_seq)),
+			 * though it might not be worth of the additional hassle
+			 *
+			 * ...we can probably just fallback to what was done
+			 * previously. We could try merging non-SACKed ones
+			 * as well but it probably isn't going to buy off
+			 * because later SACKs might again split them, and
+			 * it would make skb timestamp tracking considerably
+			 * harder problem.
+			 */
+			goto fallback;
+		}
+
+		len = end_seq - TCP_SKB_CB(skb)->seq;
+		BUG_ON(len < 0);
+		BUG_ON(len > skb->len);
+
+		/* MSS boundaries should be honoured or else pcount will
+		 * severely break even though it makes things bit trickier.
+		 * Optimize common case to avoid most of the divides
+		 */
+		mss = tcp_skb_mss(skb);
+
+		/* TODO: Fix DSACKs to not fragment already SACKed and we can
+		 * drop this restriction as unnecessary
+		 */
+		if (mss != tcp_shift_mss(prev))
+			goto fallback;
+
+		if (len == mss) {
+			pcount = 1;
+		} else if (len < mss) {
+			goto noop;
+		} else {
+			pcount = len / mss;
+			len = pcount * mss;
+		}
+	}
+
+	if (!skb_shift(prev, skb, len))
+		goto fallback;
+	if (!tcp_shifted_skb(sk, prev, skb, pcount, len, *fack_count, reord,
+			     flag, mss))
+		goto out;
+
+	/* Hole filled allows collapsing with the next as well, this is very
+	 * useful when hole on every nth skb pattern happens
+	 */
+	if (prev == tcp_write_queue_tail(sk))
+		goto out;
+	skb = tcp_write_queue_next(sk, prev);
+
+	if (!skb_can_shift(skb))
+		goto out;
+	if (skb == tcp_send_head(sk))
+		goto out;
+	if ((TCP_SKB_CB(skb)->sacked & TCPCB_TAGBITS) != TCPCB_SACKED_ACKED)
+		goto out;
+
+	len = skb->len;
+	if (skb_shift(prev, skb, len)) {
+		pcount += tcp_skb_pcount(skb);
+		tcp_shifted_skb(sk, prev, skb, tcp_skb_pcount(skb), len,
+				*fack_count, reord, flag, mss);
+	}
+
+out:
+	*fack_count += pcount;
+	return prev;
+
+noop:
+	return skb;
+
+fallback:
+	return NULL;
+}
+
 static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk,
 					struct tcp_sack_block *next_dup,
 					u32 start_seq, u32 end_seq,
 					int dup_sack_in, int *fack_count,
 					int *reord, int *flag)
 {
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *tmp;
+
 	tcp_for_write_queue_from(skb, sk) {
 		int in_sack = 0;
 		int dup_sack = dup_sack_in;
@@ -1395,18 +1613,42 @@ static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk,
 				dup_sack = 1;
 		}
 
-		if (in_sack <= 0)
-			in_sack = tcp_match_skb_to_sack(sk, skb, start_seq,
-							end_seq);
+		/* skb reference here is a bit tricky to get right, since
+		 * shifting can eat and free both this skb and the next,
+		 * so not even _safe variant of the loop is enough.
+		 */
+		if (in_sack <= 0) {
+			tmp = tcp_shift_skb_data(sk, skb, start_seq,
+						 end_seq, dup_sack,
+						 fack_count, reord, flag);
+			if (tmp != NULL) {
+				if (tmp != skb) {
+					skb = tmp;
+					continue;
+				}
+				
+				in_sack = 0;
+			} else {
+				in_sack = tcp_match_skb_to_sack(sk, skb,
+								start_seq,
+								end_seq);
+			}
+		}
+
 		if (unlikely(in_sack < 0))
 			break;
 
-		if (in_sack)
+		if (in_sack) {
 			*flag |= tcp_sacktag_one(skb, sk, reord, dup_sack,
 						 *fack_count,
 						 &(TCP_SKB_CB(skb)->sacked),
 						 tcp_skb_pcount(skb));
 
+			if (!before(TCP_SKB_CB(skb)->seq,
+				    tcp_highest_sack_seq(tp)))
+				tcp_advance_highest_sack(sk, skb);
+		}
+
 		*fack_count += tcp_skb_pcount(skb);
 	}
 	return skb;
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 07/10] tcp: Make shifting not clear the hints
  2008-11-24 14:22           ` [PATCH 06/10] tcp: Try to restore large SKBs while SACK processing Ilpo Järvinen
@ 2008-11-24 14:22             ` Ilpo Järvinen
  2008-11-24 14:22               ` [PATCH 08/10] tcp: add some mibs to track collapsing Ilpo Järvinen
  2008-11-25  5:27               ` [PATCH 07/10] tcp: Make shifting not clear the hints David Miller
  2008-11-25  5:20             ` [PATCH 06/10] tcp: Try to restore large SKBs while SACK processing David Miller
  1 sibling, 2 replies; 28+ messages in thread
From: Ilpo Järvinen @ 2008-11-24 14:22 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ilpo J�rvinen

The earlier version was just very basic one which is "playing
safe" by always clearing the hints. However, clearing of a hint
is extremely costly operation with large windows, so it must be
avoided at all cost whenever possible, there is a way with
shifting too achieve not-clearing.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |   16 ++++++++++++++--
 1 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 8106421..1f73cda 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1378,6 +1378,11 @@ static int tcp_shifted_skb(struct sock *sk, struct sk_buff *prev,
 
 	BUG_ON(!pcount);
 
+	/* Tweak before seqno plays */
+	if (!tcp_is_fack(tp) && tcp_is_sack(tp) && tp->lost_skb_hint &&
+	    !before(TCP_SKB_CB(tp->lost_skb_hint)->seq, TCP_SKB_CB(skb)->seq))
+		tp->lost_cnt_hint += pcount;
+
 	TCP_SKB_CB(prev)->end_seq += shifted;
 	TCP_SKB_CB(skb)->seq += shifted;
 
@@ -1407,8 +1412,6 @@ static int tcp_shifted_skb(struct sock *sk, struct sk_buff *prev,
 	/* Difference in this won't matter, both ACKed by the same cumul. ACK */
 	TCP_SKB_CB(prev)->sacked |= (TCP_SKB_CB(skb)->sacked & TCPCB_EVER_RETRANS);
 
-	tcp_clear_all_retrans_hints(tp);
-
 	if (skb->len > 0) {
 		BUG_ON(!tcp_skb_pcount(skb));
 		return 0;
@@ -1416,6 +1419,15 @@ static int tcp_shifted_skb(struct sock *sk, struct sk_buff *prev,
 
 	/* Whole SKB was eaten :-) */
 
+	if (skb == tp->retransmit_skb_hint)
+		tp->retransmit_skb_hint = prev;
+	if (skb == tp->scoreboard_skb_hint)
+		tp->scoreboard_skb_hint = prev;
+	if (skb == tp->lost_skb_hint) {
+		tp->lost_skb_hint = prev;
+		tp->lost_cnt_hint -= tcp_skb_pcount(prev);
+	}
+
 	TCP_SKB_CB(skb)->flags |= TCP_SKB_CB(prev)->flags;
 	if (skb == tcp_highest_sack(sk))
 		tcp_advance_highest_sack(sk, skb);
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 08/10] tcp: add some mibs to track collapsing
  2008-11-24 14:22             ` [PATCH 07/10] tcp: Make shifting not clear the hints Ilpo Järvinen
@ 2008-11-24 14:22               ` Ilpo Järvinen
  2008-11-24 14:22                 ` [PATCH 09/10] tcp: more accurate fallback counters (DEBUGONLY) Ilpo Järvinen
  2008-11-25  5:27                 ` [PATCH 08/10] tcp: add some mibs to track collapsing David Miller
  2008-11-25  5:27               ` [PATCH 07/10] tcp: Make shifting not clear the hints David Miller
  1 sibling, 2 replies; 28+ messages in thread
From: Ilpo Järvinen @ 2008-11-24 14:22 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ilpo J�rvinen

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/linux/snmp.h |    3 +++
 net/ipv4/proc.c      |    3 +++
 net/ipv4/tcp_input.c |    4 ++++
 3 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/include/linux/snmp.h b/include/linux/snmp.h
index 7a6e6bb..aee3f1e 100644
--- a/include/linux/snmp.h
+++ b/include/linux/snmp.h
@@ -216,6 +216,9 @@ enum
 	LINUX_MIB_TCPSPURIOUSRTOS,		/* TCPSpuriousRTOs */
 	LINUX_MIB_TCPMD5NOTFOUND,		/* TCPMD5NotFound */
 	LINUX_MIB_TCPMD5UNEXPECTED,		/* TCPMD5Unexpected */
+	LINUX_MIB_SACKSHIFTED,
+	LINUX_MIB_SACKMERGED,
+	LINUX_MIB_SACKSHIFTFALLBACK,
 	__LINUX_MIB_MAX
 };
 
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index a631a1f..731789b 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -234,6 +234,9 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPSpuriousRTOs", LINUX_MIB_TCPSPURIOUSRTOS),
 	SNMP_MIB_ITEM("TCPMD5NotFound", LINUX_MIB_TCPMD5NOTFOUND),
 	SNMP_MIB_ITEM("TCPMD5Unexpected", LINUX_MIB_TCPMD5UNEXPECTED),
+	SNMP_MIB_ITEM("TCPSackShifted", LINUX_MIB_SACKSHIFTED),
+	SNMP_MIB_ITEM("TCPSackMerged", LINUX_MIB_SACKMERGED),
+	SNMP_MIB_ITEM("TCPSackShiftFallback", LINUX_MIB_SACKSHIFTFALLBACK),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 1f73cda..d8235cf 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1414,6 +1414,7 @@ static int tcp_shifted_skb(struct sock *sk, struct sk_buff *prev,
 
 	if (skb->len > 0) {
 		BUG_ON(!tcp_skb_pcount(skb));
+		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SACKSHIFTED);
 		return 0;
 	}
 
@@ -1435,6 +1436,8 @@ static int tcp_shifted_skb(struct sock *sk, struct sk_buff *prev,
 	tcp_unlink_write_queue(skb, sk);
 	sk_wmem_free_skb(sk, skb);
 
+	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SACKMERGED);
+
 	return 1;
 }
 
@@ -1593,6 +1596,7 @@ noop:
 	return skb;
 
 fallback:
+	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SACKSHIFTFALLBACK);
 	return NULL;
 }
 
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 09/10] tcp: more accurate fallback counters (DEBUGONLY)
  2008-11-24 14:22               ` [PATCH 08/10] tcp: add some mibs to track collapsing Ilpo Järvinen
@ 2008-11-24 14:22                 ` Ilpo Järvinen
  2008-11-24 14:22                   ` [PATCH 10/10] tcp: handle shift/merge of cloned skbs too Ilpo Järvinen
  2008-11-25  5:27                   ` [PATCH 09/10] tcp: more accurate fallback counters (DEBUGONLY) David Miller
  2008-11-25  5:27                 ` [PATCH 08/10] tcp: add some mibs to track collapsing David Miller
  1 sibling, 2 replies; 28+ messages in thread
From: Ilpo Järvinen @ 2008-11-24 14:22 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

Not for inclusion! For those who are interested enough on internals
while maintaining high-speed.
---
 include/linux/snmp.h |   13 +++++++++++++
 net/ipv4/proc.c      |   13 +++++++++++++
 net/ipv4/tcp_input.c |   47 ++++++++++++++++++++++++++++++++++++-----------
 3 files changed, 62 insertions(+), 11 deletions(-)

diff --git a/include/linux/snmp.h b/include/linux/snmp.h
index aee3f1e..501186a 100644
--- a/include/linux/snmp.h
+++ b/include/linux/snmp.h
@@ -219,6 +219,19 @@ enum
 	LINUX_MIB_SACKSHIFTED,
 	LINUX_MIB_SACKMERGED,
 	LINUX_MIB_SACKSHIFTFALLBACK,
+	LINUX_MIB_SACKCOLLAPSEFALLBACKGSO,
+	LINUX_MIB_SACKCOLLAPSEFALLBACKSKBBITS,
+	LINUX_MIB_SACKCOLLAPSEFALLBACKSKBDATA,
+	LINUX_MIB_SACKCOLLAPSEFALLBACKBELOW,
+	LINUX_MIB_SACKCOLLAPSEFALLBACKFIRST,
+	LINUX_MIB_SACKCOLLAPSEFALLBACKPREVBITS,
+	LINUX_MIB_SACKCOLLAPSEFALLBACKMSS,
+	LINUX_MIB_SACKCOLLAPSEFALLBACKNOHEAD,
+	LINUX_MIB_SACKCOLLAPSEFALLBACKSHIFT,
+	LINUX_MIB_SACKCOLLAPSENOOPSEQ,
+	LINUX_MIB_SACKCOLLAPSENOOPSMALLPCOUNT,
+	LINUX_MIB_SACKCOLLAPSENOOPSMALLLEN,
+	LINUX_MIB_SACKCOLLAPSEHOLE,
 	__LINUX_MIB_MAX
 };
 
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 731789b..1d37fcd 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -237,6 +237,19 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPSackShifted", LINUX_MIB_SACKSHIFTED),
 	SNMP_MIB_ITEM("TCPSackMerged", LINUX_MIB_SACKMERGED),
 	SNMP_MIB_ITEM("TCPSackShiftFallback", LINUX_MIB_SACKSHIFTFALLBACK),
+	SNMP_MIB_ITEM("TCPSACKCOLLAPSEFALLBACKGSO", LINUX_MIB_SACKCOLLAPSEFALLBACKGSO),
+	SNMP_MIB_ITEM("TCPSACKCOLLAPSEFALLBACKSKBBITS", LINUX_MIB_SACKCOLLAPSEFALLBACKSKBBITS),
+	SNMP_MIB_ITEM("TCPSACKCOLLAPSEFALLBACKSKBDATA", LINUX_MIB_SACKCOLLAPSEFALLBACKSKBDATA),
+	SNMP_MIB_ITEM("TCPSACKCOLLAPSEFALLBACKBELOW", LINUX_MIB_SACKCOLLAPSEFALLBACKBELOW),
+	SNMP_MIB_ITEM("TCPSACKCOLLAPSEFALLBACKFIRST", LINUX_MIB_SACKCOLLAPSEFALLBACKFIRST),
+	SNMP_MIB_ITEM("TCPSACKCOLLAPSEFALLBACKPREVBITS", LINUX_MIB_SACKCOLLAPSEFALLBACKPREVBITS),
+	SNMP_MIB_ITEM("TCPSACKCOLLAPSEFALLBACKMSS", LINUX_MIB_SACKCOLLAPSEFALLBACKMSS),
+	SNMP_MIB_ITEM("TCPSACKCOLLAPSEFALLBACKNOHEAD", LINUX_MIB_SACKCOLLAPSEFALLBACKNOHEAD),
+	SNMP_MIB_ITEM("TCPSACKCOLLAPSEFALLBACKSHIFT", LINUX_MIB_SACKCOLLAPSEFALLBACKSHIFT),
+	SNMP_MIB_ITEM("TCPSACKCOLLAPSENOOPSEQ", LINUX_MIB_SACKCOLLAPSENOOPSEQ),
+	SNMP_MIB_ITEM("TCPSACKCOLLAPSENOOPSMALLPCOUNT", LINUX_MIB_SACKCOLLAPSENOOPSMALLPCOUNT),
+	SNMP_MIB_ITEM("TCPSACKCOLLAPSENOOPSMALLLEN", LINUX_MIB_SACKCOLLAPSENOOPSMALLLEN),
+	SNMP_MIB_ITEM("TCPSACKCOLLAPSEHOLE", LINUX_MIB_SACKCOLLAPSEHOLE),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d8235cf..ffa37f9 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1475,26 +1475,38 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
 	int len;
 	int in_sack;
 
-	if (!sk_can_gso(sk))
+	if (!sk_can_gso(sk)) {
+		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SACKCOLLAPSEFALLBACKGSO);
 		goto fallback;
+	}
 
 	/* Normally R but no L won't result in plain S */
 	if (!dup_sack &&
-	    (TCP_SKB_CB(skb)->sacked & TCPCB_TAGBITS) == TCPCB_SACKED_RETRANS)
+	    (TCP_SKB_CB(skb)->sacked & TCPCB_TAGBITS) == TCPCB_SACKED_RETRANS) {
+		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SACKCOLLAPSEFALLBACKSKBBITS);
 		goto fallback;
-	if (!skb_can_shift(skb))
+	}
+	if (!skb_can_shift(skb)) {
+		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SACKCOLLAPSEFALLBACKSKBDATA);
 		goto fallback;
+	}
 	/* This frame is about to be dropped (was ACKed). */
-	if (!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
+	if (!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una)) {
+		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SACKCOLLAPSEFALLBACKBELOW);
 		goto fallback;
+	}
 
 	/* Can only happen with delayed DSACK + discard craziness */
-	if (unlikely(skb == tcp_write_queue_head(sk)))
+	if (unlikely(skb == tcp_write_queue_head(sk))) {
+		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SACKCOLLAPSEFALLBACKFIRST);
 		goto fallback;
+	}
 	prev = tcp_write_queue_prev(sk, skb);
 
-	if ((TCP_SKB_CB(prev)->sacked & TCPCB_TAGBITS) != TCPCB_SACKED_ACKED)
+	if ((TCP_SKB_CB(prev)->sacked & TCPCB_TAGBITS) != TCPCB_SACKED_ACKED) {
+		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SACKCOLLAPSEFALLBACKPREVBITS);
 		goto fallback;
+	}
 
 	in_sack = !after(start_seq, TCP_SKB_CB(skb)->seq) &&
 		  !before(end_seq, TCP_SKB_CB(skb)->end_seq);
@@ -1507,17 +1519,23 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
 		/* TODO: Fix DSACKs to not fragment already SACKed and we can
 		 * drop this restriction as unnecessary
 		 */
-		if (mss != tcp_shift_mss(prev))
+		if (mss != tcp_shift_mss(prev)) {
+			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SACKCOLLAPSEFALLBACKMSS);
 			goto fallback;
+		}
 	} else {
-		if (!after(TCP_SKB_CB(skb)->end_seq, start_seq))
+		if (!after(TCP_SKB_CB(skb)->end_seq, start_seq)) {
+			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SACKCOLLAPSENOOPSEQ);
 			goto noop;
+		}
 		/* CHECKME: This is non-MSS split case only?, this will
 		 * cause skipped skbs due to advancing loop btw, original
 		 * has that feature too
 		 */
-		if (tcp_skb_pcount(skb) <= 1)
+		if (tcp_skb_pcount(skb) <= 1) {
+			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SACKCOLLAPSENOOPSMALLPCOUNT);
 			goto noop;
+		}
 
 		in_sack = !after(start_seq, TCP_SKB_CB(skb)->seq);
 		if (!in_sack) {
@@ -1532,6 +1550,7 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
 			 * it would make skb timestamp tracking considerably
 			 * harder problem.
 			 */
+			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SACKCOLLAPSEFALLBACKNOHEAD);
 			goto fallback;
 		}
 
@@ -1548,12 +1567,15 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
 		/* TODO: Fix DSACKs to not fragment already SACKed and we can
 		 * drop this restriction as unnecessary
 		 */
-		if (mss != tcp_shift_mss(prev))
+		if (mss != tcp_shift_mss(prev)) {
+			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SACKCOLLAPSEFALLBACKMSS);
 			goto fallback;
+		}
 
 		if (len == mss) {
 			pcount = 1;
 		} else if (len < mss) {
+			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SACKCOLLAPSENOOPSMALLLEN);
 			goto noop;
 		} else {
 			pcount = len / mss;
@@ -1561,8 +1583,10 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
 		}
 	}
 
-	if (!skb_shift(prev, skb, len))
+	if (!skb_shift(prev, skb, len)) {
+		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SACKCOLLAPSEFALLBACKSHIFT);
 		goto fallback;
+	}
 	if (!tcp_shifted_skb(sk, prev, skb, pcount, len, *fack_count, reord,
 			     flag, mss))
 		goto out;
@@ -1583,6 +1607,7 @@ static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
 
 	len = skb->len;
 	if (skb_shift(prev, skb, len)) {
+		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_SACKCOLLAPSEHOLE);
 		pcount += tcp_skb_pcount(skb);
 		tcp_shifted_skb(sk, prev, skb, tcp_skb_pcount(skb), len,
 				*fack_count, reord, flag, mss);
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 10/10] tcp: handle shift/merge of cloned skbs too
  2008-11-24 14:22                 ` [PATCH 09/10] tcp: more accurate fallback counters (DEBUGONLY) Ilpo Järvinen
@ 2008-11-24 14:22                   ` Ilpo Järvinen
  2008-11-25  5:32                     ` David Miller
  2008-11-25  5:27                   ` [PATCH 09/10] tcp: more accurate fallback counters (DEBUGONLY) David Miller
  1 sibling, 1 reply; 28+ messages in thread
From: Ilpo Järvinen @ 2008-11-24 14:22 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Ilpo J�rvinen

This caused me to get repeatably:

  tcpdump: pcap_loop: recvfrom: Bad address

Happens occassionally when I tcpdump my for-looped test xfers:
  while [ : ]; do echo -n "$(date '+%s.%N') "; ./sendfile; sleep 20; done

Rest of the relevant commands:
  ethtool -K eth0 tso off
  tc qdisc add dev eth0 root netem drop 4%
  tcpdump -n -s0 -i eth0 -w sacklog.all

Running net-next under kvm, connection goes to the same host
(basically just out of kvm). The connection itself works ok
and data gets sent without corruption even with a large
number of tests while tcpdump fails usually within less than
5 tests.

Whether it only happens because of this change or not, I
don't know for sure but it's the only thing with which
I've seen that error. The non-cloned variant works w/o it
for much longer time. I'm yet to debug where the error
actually comes from.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/core/skbuff.c |    7 ++-----
 1 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 844b8ab..57555a4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2018,13 +2018,10 @@ void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len)
 		skb_split_no_header(skb, skb1, len, pos);
 }
 
-/* Shifting from/to a cloned skb is a no-go.
- *
- * TODO: handle cloned skbs by using pskb_expand_head()
- */
+/* Shifting from/to a cloned skb is a no-go. */
 static int skb_prepare_for_shift(struct sk_buff *skb)
 {
-	return skb_cloned(skb);
+	return skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
 }
 
 /**
-- 
1.5.2.2


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 10/10] tcp: handle shift/merge of cloned skbs too
  2008-11-24 14:22                   ` [PATCH 10/10] tcp: handle shift/merge of cloned skbs too Ilpo Järvinen
@ 2008-11-25  5:32                     ` David Miller
  0 siblings, 0 replies; 28+ messages in thread
From: David Miller @ 2008-11-25  5:32 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Mon, 24 Nov 2008 16:22:07 +0200

> This caused me to get repeatably:
> 
>   tcpdump: pcap_loop: recvfrom: Bad address
> 
> Happens occassionally when I tcpdump my for-looped test xfers:
>   while [ : ]; do echo -n "$(date '+%s.%N') "; ./sendfile; sleep 20; done
> 
> Rest of the relevant commands:
>   ethtool -K eth0 tso off
>   tc qdisc add dev eth0 root netem drop 4%
>   tcpdump -n -s0 -i eth0 -w sacklog.all

I'm applying this in any event.

What could be happening is some bad clone handling
elsewhere (AF_PACKET, for example) and thus for some
reason libpcap reads stale data then performs a packet
read using incorrect lengths and this leads to reading
past the end of it's user buffer and we -EFAULT.

Just capturing the -EFAULT'ing call arguments with strace
would be enough to give some deeper clues.  Do you have
that?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 09/10] tcp: more accurate fallback counters (DEBUGONLY)
  2008-11-24 14:22                 ` [PATCH 09/10] tcp: more accurate fallback counters (DEBUGONLY) Ilpo Järvinen
  2008-11-24 14:22                   ` [PATCH 10/10] tcp: handle shift/merge of cloned skbs too Ilpo Järvinen
@ 2008-11-25  5:27                   ` David Miller
  1 sibling, 0 replies; 28+ messages in thread
From: David Miller @ 2008-11-25  5:27 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Mon, 24 Nov 2008 16:22:06 +0200

> Not for inclusion! For those who are interested enough on internals
> while maintaining high-speed.

Skipped of course :)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 08/10] tcp: add some mibs to track collapsing
  2008-11-24 14:22               ` [PATCH 08/10] tcp: add some mibs to track collapsing Ilpo Järvinen
  2008-11-24 14:22                 ` [PATCH 09/10] tcp: more accurate fallback counters (DEBUGONLY) Ilpo Järvinen
@ 2008-11-25  5:27                 ` David Miller
  1 sibling, 0 replies; 28+ messages in thread
From: David Miller @ 2008-11-25  5:27 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Mon, 24 Nov 2008 16:22:05 +0200

> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Applied.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/10] tcp: Make shifting not clear the hints
  2008-11-24 14:22             ` [PATCH 07/10] tcp: Make shifting not clear the hints Ilpo Järvinen
  2008-11-24 14:22               ` [PATCH 08/10] tcp: add some mibs to track collapsing Ilpo Järvinen
@ 2008-11-25  5:27               ` David Miller
  1 sibling, 0 replies; 28+ messages in thread
From: David Miller @ 2008-11-25  5:27 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Mon, 24 Nov 2008 16:22:04 +0200

> The earlier version was just very basic one which is "playing
> safe" by always clearing the hints. However, clearing of a hint
> is extremely costly operation with large windows, so it must be
> avoided at all cost whenever possible, there is a way with
> shifting too achieve not-clearing.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Applied.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 06/10] tcp: Try to restore large SKBs while SACK processing
  2008-11-24 14:22           ` [PATCH 06/10] tcp: Try to restore large SKBs while SACK processing Ilpo Järvinen
  2008-11-24 14:22             ` [PATCH 07/10] tcp: Make shifting not clear the hints Ilpo Järvinen
@ 2008-11-25  5:20             ` David Miller
  1 sibling, 0 replies; 28+ messages in thread
From: David Miller @ 2008-11-25  5:20 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Mon, 24 Nov 2008 16:22:03 +0200

> During SACK processing, most of the benefits of TSO are eaten by
> the SACK blocks that one-by-one fragment SKBs to MSS sized chunks.
> Then we're in problems when cleanup work for them has to be done
> when a large cumulative ACK comes. Try to return back to pre-split
> state already while more and more SACK info gets discovered by
> combining newly discovered SACK areas with the previous skb if
> that's SACKed as well.
 ...
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Really nice work, patch applied, thanks!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 05/10] tcp: make tcp_sacktag_one able to handle partial skb too
  2008-11-24 14:22         ` [PATCH 05/10] tcp: make tcp_sacktag_one able to handle partial skb too Ilpo Järvinen
  2008-11-24 14:22           ` [PATCH 06/10] tcp: Try to restore large SKBs while SACK processing Ilpo Järvinen
@ 2008-11-25  5:15           ` David Miller
  2008-11-25 13:44             ` Ilpo Järvinen
  1 sibling, 1 reply; 28+ messages in thread
From: David Miller @ 2008-11-25  5:15 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Mon, 24 Nov 2008 16:22:02 +0200

> This is preparatory work for SACK combiner patch which may
> have to count TCP state changes for only a part of the skb
> because it will intentionally avoids splitting skb to SACKed
> and not sacked parts.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Applied.

Although we need desperately to simplify this function at
some point.  Anything past 6 arguments and you're going onto
the stack on just about every cpu.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 05/10] tcp: make tcp_sacktag_one able to handle partial skb too
  2008-11-25  5:15           ` [PATCH 05/10] tcp: make tcp_sacktag_one able to handle partial skb too David Miller
@ 2008-11-25 13:44             ` Ilpo Järvinen
  0 siblings, 0 replies; 28+ messages in thread
From: Ilpo Järvinen @ 2008-11-25 13:44 UTC (permalink / raw)
  To: David Miller; +Cc: Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 897 bytes --]

On Mon, 24 Nov 2008, David Miller wrote:

> From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Mon, 24 Nov 2008 16:22:02 +0200
> 
> > This is preparatory work for SACK combiner patch which may
> > have to count TCP state changes for only a part of the skb
> > because it will intentionally avoids splitting skb to SACKed
> > and not sacked parts.
> > 
> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> 
> Applied.
> 
> Although we need desperately to simplify this function at
> some point.  Anything past 6 arguments and you're going onto
> the stack on just about every cpu.

Sure, I made a note about that in the combine change which makes things 
much worse yet :-), which you obviously didn't read... ;-) I certainly 
was going to do that but didn't seem so useful to waste too much time 
early on it until I had a well working solution to the actual problem :-).

-- 
 i.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 04/10] tcp: Make SACK code to split only at mss boundaries
  2008-11-24 14:22       ` [PATCH 04/10] tcp: Make SACK code to split only at mss boundaries Ilpo Järvinen
  2008-11-24 14:22         ` [PATCH 05/10] tcp: make tcp_sacktag_one able to handle partial skb too Ilpo Järvinen
@ 2008-11-25  5:13         ` David Miller
  1 sibling, 0 replies; 28+ messages in thread
From: David Miller @ 2008-11-25  5:13 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Mon, 24 Nov 2008 16:22:01 +0200

> Sadly enough, this adds possible divide though we try to avoid
> it by checking one mss as common case.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Applied.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 03/10] tcp: more aggressive skipping
  2008-11-24 14:22     ` [PATCH 03/10] tcp: more aggressive skipping Ilpo Järvinen
  2008-11-24 14:22       ` [PATCH 04/10] tcp: Make SACK code to split only at mss boundaries Ilpo Järvinen
@ 2008-11-25  5:12       ` David Miller
  1 sibling, 0 replies; 28+ messages in thread
From: David Miller @ 2008-11-25  5:12 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Mon, 24 Nov 2008 16:22:00 +0200

> I knew already when rewriting the sacktag that this condition
> was too conservative, change it now since it prevent lot of
> useless work (especially in the sack shifter decision code
> that is being added by a later patch). This shouldn't change
> anything really, just save some processing regardless of the
> shifter.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Applied.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 02/10] tcp: move tcp_simple_retransmit to tcp_input
  2008-11-24 14:21   ` [PATCH 02/10] tcp: move tcp_simple_retransmit to tcp_input Ilpo Järvinen
  2008-11-24 14:22     ` [PATCH 03/10] tcp: more aggressive skipping Ilpo Järvinen
@ 2008-11-24 14:50     ` Ilpo Järvinen
  2008-11-24 16:36     ` Andi Kleen
  2008-11-25  5:10     ` David Miller
  3 siblings, 0 replies; 28+ messages in thread
From: Ilpo Järvinen @ 2008-11-24 14:50 UTC (permalink / raw)
  To: David Miller; +Cc: Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5448 bytes --]

On Mon, 24 Nov 2008, Ilpo Järvinen wrote:

> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -472,8 +472,6 @@ extern void tcp_send_delayed_ack(struct sock *sk);
>  
>  /* tcp_input.c */
>  extern void tcp_cwnd_application_limited(struct sock *sk);
> -extern void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp,
> -					    struct sk_buff *skb);
>  
>  /* tcp_timer.c */
>  extern void tcp_init_xmit_timers(struct sock *);

Doh, already noticed I'm missing the to-static counterpart of this.
A fixed patch below...

--
 i.


[PATCHv2 02/10] tcp: move tcp_simple_retransmit to tcp_input

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/net/tcp.h     |    2 -
 net/ipv4/tcp_input.c  |   53 ++++++++++++++++++++++++++++++++++++++++++++++++-
 net/ipv4/tcp_output.c |   50 ----------------------------------------------
 3 files changed, 52 insertions(+), 53 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 8f26b28..90b4c3b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -472,8 +472,6 @@ extern void tcp_send_delayed_ack(struct sock *sk);
 
 /* tcp_input.c */
 extern void tcp_cwnd_application_limited(struct sock *sk);
-extern void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp,
-					    struct sk_buff *skb);
 
 /* tcp_timer.c */
 extern void tcp_init_xmit_timers(struct sock *);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 097294b..8085704 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1002,7 +1002,8 @@ static void tcp_skb_mark_lost(struct tcp_sock *tp, struct sk_buff *skb)
 	}
 }
 
-void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb)
+static void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp,
+					    struct sk_buff *skb)
 {
 	tcp_verify_retransmit_hint(tp, skb);
 
@@ -2559,6 +2560,56 @@ static void tcp_mtup_probe_success(struct sock *sk, struct sk_buff *skb)
 	tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
 }
 
+/* Do a simple retransmit without using the backoff mechanisms in
+ * tcp_timer. This is used for path mtu discovery.
+ * The socket is already locked here.
+ */
+void tcp_simple_retransmit(struct sock *sk)
+{
+	const struct inet_connection_sock *icsk = inet_csk(sk);
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb;
+	unsigned int mss = tcp_current_mss(sk, 0);
+	u32 prior_lost = tp->lost_out;
+
+	tcp_for_write_queue(skb, sk) {
+		if (skb == tcp_send_head(sk))
+			break;
+		if (skb->len > mss &&
+		    !(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)) {
+			if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS) {
+				TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
+				tp->retrans_out -= tcp_skb_pcount(skb);
+			}
+			tcp_skb_mark_lost_uncond_verify(tp, skb);
+		}
+	}
+
+	tcp_clear_retrans_hints_partial(tp);
+
+	if (prior_lost == tp->lost_out)
+		return;
+
+	if (tcp_is_reno(tp))
+		tcp_limit_reno_sacked(tp);
+
+	tcp_verify_left_out(tp);
+
+	/* Don't muck with the congestion window here.
+	 * Reason is that we do not increase amount of _data_
+	 * in network, but units changed and effective
+	 * cwnd/ssthresh really reduced now.
+	 */
+	if (icsk->icsk_ca_state != TCP_CA_Loss) {
+		tp->high_seq = tp->snd_nxt;
+		tp->snd_ssthresh = tcp_current_ssthresh(sk);
+		tp->prior_ssthresh = 0;
+		tp->undo_marker = 0;
+		tcp_set_ca_state(sk, TCP_CA_Loss);
+	}
+	tcp_xmit_retransmit_queue(sk);
+}
+
 /* Process an event, which can update packets-in-flight not trivially.
  * Main goal of this function is to calculate new estimate for left_out,
  * taking into account both packets sitting in receiver's buffer and
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 86ef989..c069ecb 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1879,56 +1879,6 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
 	}
 }
 
-/* Do a simple retransmit without using the backoff mechanisms in
- * tcp_timer. This is used for path mtu discovery.
- * The socket is already locked here.
- */
-void tcp_simple_retransmit(struct sock *sk)
-{
-	const struct inet_connection_sock *icsk = inet_csk(sk);
-	struct tcp_sock *tp = tcp_sk(sk);
-	struct sk_buff *skb;
-	unsigned int mss = tcp_current_mss(sk, 0);
-	u32 prior_lost = tp->lost_out;
-
-	tcp_for_write_queue(skb, sk) {
-		if (skb == tcp_send_head(sk))
-			break;
-		if (skb->len > mss &&
-		    !(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)) {
-			if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS) {
-				TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
-				tp->retrans_out -= tcp_skb_pcount(skb);
-			}
-			tcp_skb_mark_lost_uncond_verify(tp, skb);
-		}
-	}
-
-	tcp_clear_retrans_hints_partial(tp);
-
-	if (prior_lost == tp->lost_out)
-		return;
-
-	if (tcp_is_reno(tp))
-		tcp_limit_reno_sacked(tp);
-
-	tcp_verify_left_out(tp);
-
-	/* Don't muck with the congestion window here.
-	 * Reason is that we do not increase amount of _data_
-	 * in network, but units changed and effective
-	 * cwnd/ssthresh really reduced now.
-	 */
-	if (icsk->icsk_ca_state != TCP_CA_Loss) {
-		tp->high_seq = tp->snd_nxt;
-		tp->snd_ssthresh = tcp_current_ssthresh(sk);
-		tp->prior_ssthresh = 0;
-		tp->undo_marker = 0;
-		tcp_set_ca_state(sk, TCP_CA_Loss);
-	}
-	tcp_xmit_retransmit_queue(sk);
-}
-
 /* This retransmits one SKB.  Policy decisions and retransmit queue
  * state updates are done by the caller.  Returns non-zero if an
  * error occurred which prevented the send.
-- 
1.5.2.2

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 02/10] tcp: move tcp_simple_retransmit to tcp_input
  2008-11-24 14:21   ` [PATCH 02/10] tcp: move tcp_simple_retransmit to tcp_input Ilpo Järvinen
  2008-11-24 14:22     ` [PATCH 03/10] tcp: more aggressive skipping Ilpo Järvinen
  2008-11-24 14:50     ` [PATCH 02/10] tcp: move tcp_simple_retransmit to tcp_input Ilpo Järvinen
@ 2008-11-24 16:36     ` Andi Kleen
  2008-11-24 16:58       ` Ilpo Järvinen
  2008-11-25  5:10     ` David Miller
  3 siblings, 1 reply; 28+ messages in thread
From: Andi Kleen @ 2008-11-24 16:36 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: David Miller, netdev

"Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> writes:

tcp: move tcp_simple_retransmit to tcp_input

Why? It seems like a clear output function.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 02/10] tcp: move tcp_simple_retransmit to tcp_input
  2008-11-24 16:36     ` Andi Kleen
@ 2008-11-24 16:58       ` Ilpo Järvinen
  2008-11-24 19:07         ` Andi Kleen
  0 siblings, 1 reply; 28+ messages in thread
From: Ilpo Järvinen @ 2008-11-24 16:58 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 786 bytes --]

On Mon, 24 Nov 2008, Andi Kleen wrote:

> "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> writes:
> 
> tcp: move tcp_simple_retransmit to tcp_input
> 
> Why? It seems like a clear output function.

What makes you think so? It's as much output as e.g., tcp_enter_loss
(or in fact less than that since enter_loss is triggered mainly by rto)?

Besides, it's action based on input, the name is just misnower as it 
retransmit nothing except calling tcp_xmit_retransmit_queue. Every state 
change it does is similar to what is being done elsewhere in 
the tcp_input.c (I won't waste my time on listing them here :-)).

Not that it's that extremely big deal but doing some wq processing 
related changes have forced me to go into that other file just because
of that particular function.

-- 
 i.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 02/10] tcp: move tcp_simple_retransmit to tcp_input
  2008-11-24 16:58       ` Ilpo Järvinen
@ 2008-11-24 19:07         ` Andi Kleen
  0 siblings, 0 replies; 28+ messages in thread
From: Andi Kleen @ 2008-11-24 19:07 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Andi Kleen, David Miller, Netdev

On Mon, Nov 24, 2008 at 06:58:25PM +0200, Ilpo Järvinen wrote:
> On Mon, 24 Nov 2008, Andi Kleen wrote:
> 
> > "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> writes:
> > 
> > tcp: move tcp_simple_retransmit to tcp_input
> > 
> > Why? It seems like a clear output function.
> 
> What makes you think so? It's as much output as e.g., tcp_enter_loss

the path mtu code calls it to output a packet.

> (or in fact less than that since enter_loss is triggered mainly by rto)?

enter_loss doesn't send a packet.

-Andi

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 02/10] tcp: move tcp_simple_retransmit to tcp_input
  2008-11-24 14:21   ` [PATCH 02/10] tcp: move tcp_simple_retransmit to tcp_input Ilpo Järvinen
                       ` (2 preceding siblings ...)
  2008-11-24 16:36     ` Andi Kleen
@ 2008-11-25  5:10     ` David Miller
  2008-11-25 14:02       ` Ilpo Järvinen
  3 siblings, 1 reply; 28+ messages in thread
From: David Miller @ 2008-11-25  5:10 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Mon, 24 Nov 2008 16:21:59 +0200

> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

I applied this with a minor modification.

tcp_skb_mark_lost_uncond_verify() can now be marked static, so I made
that change.

BTW, that would have generated a sparse warning had you sparse on
your changes :)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 02/10] tcp: move tcp_simple_retransmit to tcp_input
  2008-11-25  5:10     ` David Miller
@ 2008-11-25 14:02       ` Ilpo Järvinen
  2008-11-25 21:45         ` David Miller
  0 siblings, 1 reply; 28+ messages in thread
From: Ilpo Järvinen @ 2008-11-25 14:02 UTC (permalink / raw)
  To: David Miller; +Cc: Netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1888 bytes --]

On Mon, 24 Nov 2008, David Miller wrote:

> From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Mon, 24 Nov 2008 16:21:59 +0200
> 
> > Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> 
> I applied this with a minor modification.
> 
> tcp_skb_mark_lost_uncond_verify() can now be marked static, so I made
> that change.
> 
> BTW, that would have generated a sparse warning had you sparse on
> your changes :)

I did use sparse, that is, when I finally remembered :-) but the results 
didn't made to the series I sent but I followed up with v2 of this 
particular patch. ...I also realized that another function can be made 
static as a result of this move, patch below.

-- 
 i.


--
[PATCH] tcp: tcp_limit_reno_sacked can become static

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/net/tcp.h    |    2 --
 net/ipv4/tcp_input.c |    2 +-
 2 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 2653924..e8ae90a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -761,8 +761,6 @@ static inline unsigned int tcp_packets_in_flight(const struct tcp_sock *tp)
 	return tp->packets_out - tcp_left_out(tp) + tp->retrans_out;
 }
 
-extern int tcp_limit_reno_sacked(struct tcp_sock *tp);
-
 /* If cwnd > ssthresh, we may raise ssthresh to be half-way to cwnd.
  * The exception is rate halving phase, when cwnd is decreasing towards
  * ssthresh.
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9f8a80b..d67b6e9 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1940,7 +1940,7 @@ out:
 /* Limits sacked_out so that sum with lost_out isn't ever larger than
  * packets_out. Returns zero if sacked_out adjustement wasn't necessary.
  */
-int tcp_limit_reno_sacked(struct tcp_sock *tp)
+static int tcp_limit_reno_sacked(struct tcp_sock *tp)
 {
 	u32 holes;
 
-- 
1.5.2.2

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 02/10] tcp: move tcp_simple_retransmit to tcp_input
  2008-11-25 14:02       ` Ilpo Järvinen
@ 2008-11-25 21:45         ` David Miller
  0 siblings, 0 replies; 28+ messages in thread
From: David Miller @ 2008-11-25 21:45 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Tue, 25 Nov 2008 16:02:22 +0200 (EET)

> [PATCH] tcp: tcp_limit_reno_sacked can become static
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Applied, thanks Ilpo.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 01/10] tcp: collapse more than two on retransmission
  2008-11-24 14:21 ` [PATCH 01/10] tcp: collapse more than two on retransmission Ilpo Järvinen
  2008-11-24 14:21   ` [PATCH 02/10] tcp: move tcp_simple_retransmit to tcp_input Ilpo Järvinen
@ 2008-11-25  5:05   ` David Miller
  1 sibling, 0 replies; 28+ messages in thread
From: David Miller @ 2008-11-25  5:05 UTC (permalink / raw)
  To: ilpo.jarvinen; +Cc: netdev

From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Mon, 24 Nov 2008 16:21:58 +0200

> I always had thought that collapsing up to two at a time was
> intentional decision to avoid excessive processing if 1 byte
> sized skbs are to be combined for a full mtu, and consecutive
> retransmissions would make the size of the retransmittee
> double each round anyway, but some recent discussion made me
> to understand that was not the case. Thus make collapse work
> more and wait less.
> 
> It would be possible to take advantage of the shifting
> machinery (added in the later patch) in the case of paged
> data but that can be implemented on top of this change.

Yes, the ->nr_frags test is a real limiter these days
because on any modern device all TCP data is paged,
whether sendmsg() or sendpage() generated.

So you must have used a non-SG capable device for your
tests or simply turned SG off using ethtool :)

 ...
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Applied, thanks!

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2008-11-25 21:45 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-24 14:21 [PATCH 0/n]: 2 is better than 1 - tcp recombining; both with SACK and rexmits Ilpo Järvinen
2008-11-24 14:21 ` [PATCH 01/10] tcp: collapse more than two on retransmission Ilpo Järvinen
2008-11-24 14:21   ` [PATCH 02/10] tcp: move tcp_simple_retransmit to tcp_input Ilpo Järvinen
2008-11-24 14:22     ` [PATCH 03/10] tcp: more aggressive skipping Ilpo Järvinen
2008-11-24 14:22       ` [PATCH 04/10] tcp: Make SACK code to split only at mss boundaries Ilpo Järvinen
2008-11-24 14:22         ` [PATCH 05/10] tcp: make tcp_sacktag_one able to handle partial skb too Ilpo Järvinen
2008-11-24 14:22           ` [PATCH 06/10] tcp: Try to restore large SKBs while SACK processing Ilpo Järvinen
2008-11-24 14:22             ` [PATCH 07/10] tcp: Make shifting not clear the hints Ilpo Järvinen
2008-11-24 14:22               ` [PATCH 08/10] tcp: add some mibs to track collapsing Ilpo Järvinen
2008-11-24 14:22                 ` [PATCH 09/10] tcp: more accurate fallback counters (DEBUGONLY) Ilpo Järvinen
2008-11-24 14:22                   ` [PATCH 10/10] tcp: handle shift/merge of cloned skbs too Ilpo Järvinen
2008-11-25  5:32                     ` David Miller
2008-11-25  5:27                   ` [PATCH 09/10] tcp: more accurate fallback counters (DEBUGONLY) David Miller
2008-11-25  5:27                 ` [PATCH 08/10] tcp: add some mibs to track collapsing David Miller
2008-11-25  5:27               ` [PATCH 07/10] tcp: Make shifting not clear the hints David Miller
2008-11-25  5:20             ` [PATCH 06/10] tcp: Try to restore large SKBs while SACK processing David Miller
2008-11-25  5:15           ` [PATCH 05/10] tcp: make tcp_sacktag_one able to handle partial skb too David Miller
2008-11-25 13:44             ` Ilpo Järvinen
2008-11-25  5:13         ` [PATCH 04/10] tcp: Make SACK code to split only at mss boundaries David Miller
2008-11-25  5:12       ` [PATCH 03/10] tcp: more aggressive skipping David Miller
2008-11-24 14:50     ` [PATCH 02/10] tcp: move tcp_simple_retransmit to tcp_input Ilpo Järvinen
2008-11-24 16:36     ` Andi Kleen
2008-11-24 16:58       ` Ilpo Järvinen
2008-11-24 19:07         ` Andi Kleen
2008-11-25  5:10     ` David Miller
2008-11-25 14:02       ` Ilpo Järvinen
2008-11-25 21:45         ` David Miller
2008-11-25  5:05   ` [PATCH 01/10] tcp: collapse more than two on retransmission David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).